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Abstract 

The ongoing rapid expansion of the Internet greatly increases the necessity of effective 
recommender systems for filtering the abundant information. Extensive research for rec- 
ommender systems is conducted by a broad range of communities including social and 
computer scientists, physicists, and interdisciplinary researchers. Despite substantial theo- 
retical and practical achievements, unification and comparison of different approaches are 
lacking, which impedes further advances. In this article, we review recent developments in 
recommender systems and discuss the major challenges. We compare and evaluate available 
algorithms and examine their roles in the future developments. In addition to algorithms, 
physical aspects are described to illustrate macroscopic behavior of recommender systems. 
Potential impacts and future directions are discussed. We emphasize that recommendation 
has a great scientific depth and combines diverse research fields which makes it of interests 
for physicists as well as interdisciplinary researchers. 
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1. Introduction 

Thanks to computers and computer networks, our society is undergoing rapid transfor- 
mation in almost all aspects. We buy online, gather information by search engines and live 
a significant part of our social life on the Internet. The fact that many of our actions and 
interactions are nowadays stored electronically gives researchers the opportunity to study 
socio-economical and techno-social systems at much better level of detail. Traditional "soft 
sciences" , such as sociology or economics, have their fast-growing branches relying on the 
study of these newly available massive data sets [U |2j. Physicists, with their long experi- 
ence with data-driven research, have joined this trend and contributed to many fields such 
as finance [3j H] , network theory [5|, El El |9] and social dynamics [10] which are outside 
their traditional realm. The study of recommender systems and information filtering in 
general is no exception with the interest of physicists steadily increasing over the past 
decade. The task of recommender systems is to turn data on users and their preferences 
into predictions of users' possible future likes and interests. The study of recommender 
systems is at crossroads of science and socio-economic life and its huge potential was first 
noticed by web entrepreneurs in the forefront of the information revolution. While being 
originally a field dominated by computer scientists, recommendation calls for contributions 
from various directions and is now a topic of interest also for mathematicians, physicists, 
and psychologists. For instance, it is not a coincidence that an approach based on what 
psychologists know about human behavior scored high in a recent recommendation contest 
organized by the commercial company Netflix [TT] . 

When computing recommendations for a particular user, the very basic approach is 
to select the objects favored by other users that are similar to the target user. Even 
this simple approach can be realized in a multitude of ways — this is because the field of 
recommendation lacks general "first principles" from which one could deduce the right way 
to recommend. For example, how best to measure user similarity and assess its uncertainty? 
How to aggregate divergent opinions from various users? How to handle users for whom 
little information is available? Should all data be trusted equally or can one detect reckless 
or intentionally misleading opinions? These and similar issues arise also when methods 
more sophisticated than those based on user similarity are used. Fortunately, there exist 
a number of real data sets that can be used to measure and compare performance of 
individual methods. In consequence, similarly to physics, it is the experiment what decides 
which recommendation approach is good and which is not. 

It would be very misleading to think that recommender systems are studied only because 
suitable data sets are available. While the availability of data is important for empirical 
evaluation of recommendation methods, the main driving force comes from practice: elec- 
tronic systems give us too much choice to handle by ourselves. The interest from industry 
is hardly surprising — an early book on the nascent field of recommendation, Net Worth by 
John Hagel III and Marc Singer [12] , clearly pointed out the enormous economic impact of 
"info-mediaries" who can greatly enhance individual consumers' information capabilities. 
Most e-commerce web sites now offer various forms of recommendation — ranging from sim- 
ply showing the most popular items or suggesting other products by the same producer to 
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complicated data mining techniques. People soon realized that there is no unique best rec- 
ommendation method. Rather, depending on the context and density of the available data, 
different methods adapting to particular applications are most likely to succeed. Hence 
there is no panacea, and the best one can do is to understand the underlying premises and 
recommender mechanisms, then one can tackle many diverse application problems from 
the real life examples. This is also reflected in this review where we do not try to highlight 
any ultimate approach to recommendation. Instead, we review the basic ideas, methods 
and tools with particular emphasis on physics-rooted approaches. 

The motivation for writing this review is multifold. Firstly, while extensive reviews 
of recommender systems by computer scientists already exists p3J El US], the view of 
physicists is different from that of computer scientists by using more the complex networks 
approach and adapting various classical physics processes (such as diffusion) for information 
filtering. We thus believe that this review with its structure and emphasis on respective 
topics can provide a novel point of view. Secondly, the past decade has already seen a 
growing interest of physicists in recommender systems and we hope that this review can 
be a useful source for them by describing the state of the art in language which is more 
familiar to the physics community. Finally, the interdisciplinary approach presented here 
might provide new insights and solutions for open problems and challenges in the active 
field of information filtering. 

This review is organized as follows. To better motivate the problem, In Section [2] we 
begin with a discussion of real applications of recommender systems. Next, in Section [3] we 
introduce basic concepts — such as complex networks, recommender systems, and metrics 
for their evaluation — that form a basis for all subsequent exposition. Then we proceed to 
description of recommendation algorithms where traditional approaches (similarity-based 
methods in Section [4] and dimensionality reduction techniques in Section [5]) are followed by 
network-based approaches which have their origin in the random walk process well known 
to all physicists (in Section |6|. Methods based on external information, such as social 
relationships (in Section [7]), keywords or time stamps (in Section [8]), are also included. We 
conclude with a brief evaluation of methods' performance in Section [9] and a discussion on 
the outlook of the field in Section [lOl 

2. Real Applications of Recommender Systems 

Thanks to the ever-decreasing costs of data storage and processing, recommender sys- 
tems gradually spread to most areas of our lives. Sellers carefully watch our purchases to 
recommend us other goods and enhance their sales, social web sites analyze our contacts 
to help us connect with new friends and get hooked with the site, and online radio stations 
remember skipped songs to serve us better in the future (see more examples in Table [T]). 
In general, whenever there is plenty of diverse products and customers are not alike, per- 
sonalized recommendation may help to deliver the right content to the right person. This 
is particularly the case for those Internet-based companies that try to make use of the 
so-called long-tail [TB] of goods which are rarely purchased yet due to their multitude they 
can yield considerable profits (sometimes they are referred to as "worst-sellers"). For ex- 
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ample on Amazon, between 20 to 40 percent of sales is due to products that do not belong 
to the shop's 100 000 most saled products [17J. A recommender system may hence have 
significant impact on a company's revenues: for example, 60% of DVDs rented by Netflix 
are selected based on personalized recommendations^ 

As discussed in [18], recommender systems not only help decide which products to 
offer to an individual customer, they also increase cross-sell by suggesting additional prod- 
ucts to the customers and improve consumer loyalty because consumers tend to return to 
the sites that best serve their needs (see [TS] for an empirical analysis of the impact of 
recommendations and consumer feedback on sales at Amazon.com). 

Since no recommendation method serves best all customers, major sites are usually 
equipped with several distinct recommendation techniques ranging from simple popularity- 
based recommendations to sophisticated techniques many of which we shall encounter in 
the following sections. Further, new companies emerge (see, for example, string.com) 
which aim at collecting all sorts of user behavior (ranging from pages visited on the web 
and music listened on a personal player to "liking" or purchasing items) and using it to 
provide personalized recommendations of different goods or services. 

2.1. Netflix Prize 

In October 2006, the online DVD rental company Netflix released a dataset containing 
approximately 100 million anonymous movie ratings and challenged researchers and prac- 
titioners to develop recommender systems that could beat the accuracy of the company's 
recommendation system, Cinematch [20]. Atlhough the released data set represented only 
a small fraction of the company's rating data, thanks to its size and quality it fast became 
a standard in the data mining and machine learning community. The data set contained 
ratings in the integer scale from 1 to 5 which were accompanied by dates. For each movie, 
title and year of release were provided. No information about users was given. Submitted 
predictions were evaluated by their root mean squared error (RMSE) on a qualifying data 
set containing over 2,817,131 unknown ratings. Out of 20,000 registered teams, 2,000 teams 
submitted at least one answer set. On 21 September 2009, the grand prize of $1,000,000 
was awarded to a team that overperformed the Cinematch's accuracy by 10%. At the 
time when the contest was closed, there were two teams that achieved the same precision. 
The prize was awarded to the team that submitted their results 20 minutes earlier than 
the other one. (See [11] for a popular account on how the participants struggled with the 
challenge.) 

There are several lessons that we have learned in this competition [2T]. Firstly, the 
company gained publicity and a superior recommendation system that is supposed to 
improve user satisfaction. Secondly, ensemble methods showed their potential of improving 
accuracy of the predictionsF] Thirdly, we saw that accuracy improvements are increasingly 



-"As presented by Jon Sanders (Recommendation Systems Engineering, Netflix) during the talk "Re- 
search Challenges in Recommenders" at the 3rd ACM Conference on Recommender Systems (2009). 

2 The ensemble methods deal with the selection and organization of many individual algorithms to 
achieve better prediction accuracy. In fact, the winning team, called BellKor's Pragmatic Chaos, was a 
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Site 


What is recommended 


Amazon 


books/other products 


Facebook 


friends 


WeFollow 


friends 


MovieLens 


movies 


Nanocrowd 


movies 


Jinni 


movies 


Findory 


news 


Digg 


news 


Zite 


news 


Meehive 


news 


Netflix 


DVDs 


CDNOW 


CDs/DVDs 


eHarmony 


dates 


Chemistry 


dates 


True.com 


dates 


Perfectmatch 


dates 


CareerBuilder 


jobs 


Monster 


jobs 


Pandora 


music 


Mufin 


music 


StumbleUpon 


web sites 



Tabic 1: Popular sites using recommender systems. Besides, there are also some companies de- 
voting themselves to recommendation techniques, such as Baifcndian (www.baifendian.com), Baynote 
(www.baynote.com), ChoiceStream (www.choicestream.com), Goodrec (www.goodrec.com), and others. 
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demanding when RMSE drops below a certain level. Finally, despite the company's effort, 
anonymity of its users was not sufficiently ensured [25]. As a result, Netffix was sued by 
one of its users and decided to cancel a planned second competition. 

2.2. Major challenges 

Researchers in the field of recommender systems face several challenges which pose 
danger for the use and performance of their algorithms. Here we mention only the major 
ones: 

1. Data sparsity. Since the pool of available items is often exceedingly large (major on- 
line bookstores offer several millions of books, for example), overlap between two users 
is often very small or none. Further, even when the average number of evaluations 
per user/item are high, they are distributed among the users/items very unevenly 
(usually they follow a power-law distribution [26]) and hence majority of users/items 
may have expressed/received only a few ratings. Hence, an effective recommender 
algorithm must take the data sparsity into account [2Tj . 

2. Scalability. While the data is mostly sparse, for major sites it includes millions of 
users and items. It is therefore essential to consider the computational cost issues 
and search for recommender algorithms that are either little demanding or easy to 
parallelize (or both). Another possible solution is based on using incremental versions 
of the algorithms where, as the data grows, recommendations are not recomputed 
globally (using the whole data) but incrementally (by slightly adjusting previous 
recommendations according to the newly arrived data) [28, 29J. This incremental 
approach is similar to perturbation techniques that are widely used in physics and 
mathematics [50] . 

3. Cold start. When new users enter the system, there is usually insufficient information 
to produce recommendation for them. The usual solutions of this problem are based 



on using hybrid recommender techniques (see Section 8.4) combining content and 
collaborative data (3TJ [32] and sometimes they are accompanied by asking for some 
base information (such as age, location and preferred genres) from the users. Another 
way is to identify individual users in different web services. For example, Baifendian 
developped a technique that could track individual users' activities in several e- 
commerce sites, so that for a cold-start user in site A, we could make recommendation 
according to her records in sites B, C, D, etc. 
4. Diversity vs. accuracy. When the task is to recommend items which are likely to be 
appreciated by a particular user, it is usually most effective to recommend popular 
and highly rated items. Such recommendation, however, has very little value for the 
users because popular objects are easy to find (often they are even hard to avoid) 
without a recommender system. A good list of recommended items hence should 



combined team of BellKor [22] , Pragmatic Theory [53] and BigChaos [23] (of course, it was not a simple 
combination but a sophisticated design), and each of them consists of many individual algorithms. For 
example, the Pragmatic Theory solution considered 453 individual algorithms. 
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contain also less obvious items that are unlikely to be reached by the users themselves 
|33j . Approaches to this problem include direct enhancement of the recommendation 
list's diversity [3U E51 EE] and the use of hybrid recommendation methods [37J. 
Vulnerability to attacks. Due to their importance in e-commerce applications, rec- 
ommender systems are likely targets of malicious attacks trying to unjustly promote 
or inhibit some items [38] . There is a wide scale of tools preventing this kind of be- 
havior, ranging from blocking the malicious evaluations from entering the system to 
sophisticated resistant recommendation techniques [3H]- However, this is not a easy 
task since the strategies of attackers also get more and more advanced as the develop- 
ing of preventing tools. As an example, Burke et al. [40J introduced eight attacking 
strategies, which are further divided into four classes: basic attack, low-acknowledge 
attack, nuke attack and informed attack. 

The value of time. While real users have interests with widely diverse time scales (for 
example, short term interests related to a planned trip and long term interests re- 
lated to the place of living or political preferences), most recommendation algorithms 
neglect the time stamps of evaluations. It is an ongoing line of research whether and 
how value of old opinions should decay with time and what are the typical temporary 
patterns in user evaluations and item relevance [HI W2\ . 

Evaluation of recommendations. While we have plenty of distinct metrics (see Section 



3.4), how to choose the ones best corresponding to the given situation and task is still 
an open question. Comparisons of different recommender algorithms are also prob- 
lematic because different algorithms may simply solve different tasks. Finally, the 
overall user experience with a given recommendation system — which includes user's 
satisfaction with the recommendations and user's trust in the system — is difficult to 
measure in "offline" evaluation. Empirical user studies thus still represent a welcome 
source of feedback on recommender systems. 
8. User interface. It has been shown that to facilitate users' acceptance of recommen- 
dations, the recommendations need to be transparent [HU HI] : users appreciate when 
it is clear why a particular item has been recommended to them. Another issue is 
that since the list of potentially interesting items may be very long, it needs to be 
presented in a simple way and it should be easy to navigate through it to browse 
different recommendations which are often obtained by distinct approaches. 

Besides the above long-standing challenges, many novel issues appear recently. Thanks 
to the development of methodlogy in related branches of science, especially the new tools 
in network analysis, scientists started to consider the effecrs of network structure on recom- 
mendation and how to make use of known structural features to improve recommendation. 
For example, Huang et al. [15] analyzed the consumer-product networks and proposed 
an improved recommendation algorithm preferring edges that enhance the local cluster- 
ing property, and Sahebi et al. [IB] designed an improved algorithm making use of the 
community structure. Progress and propagation of new techniques also bring new chal- 
lenges. For example, the GPS equipped mobile phones have become mainstream and the 
Internet access is ubiquitous, hence the location-based recommendation is now fesaible and 
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increasingly significant]^] Accurate recommendation asks for both the high predictability 
of human movements [17J HH] and quantitative way to define similarities between locations 
and people [IH1 EO] • Lastly, intelligent recommender systems should take into account the 
different behavioral patterns of different people. For example, new users tend to visit very 
popular items and select similar items, while old users usually have more specific interests 
[5T| 152] . and users behave much differently between low-risk (e.g., collecting bookmarks, 
downloading music, etc.) and high-risk (e.g., buying a computer, renting a house, etc.) 
activities [321 El] • 

3. Definitions of Subjects and Problems 

We briefly review in this chapter basic concepts that are useful in the study of recom- 
mender systems. 

3. 1 . Networks 

Network analysis is a versatile tool in uncovering the organization principles of many 
complex systems [5J El El El E] • A network is a set of elements (called nodes or vertices) with 
connections (called edges or links) between them. Many social, biological, technological 
and information systems can be described as networks with nodes representing individuals 
or organizations and edges capturing their interactions. The study of networks, referred to 
as graph theory in mathematical literature, has a long history that begins with the classical 
Konigsberg bridge problem solved by Euler in 18th century [55]. Mathematically speaking, 
a network G is an ordered pair of disjoint sets (V, E) where V is the set of nodes and the 
set of edges, E, is a subset of V x V [56]. In an undirected network, an edge joining nodes 
x and y is denoted by x -H- y, and x •<->■ y and y -h- x mean exactly the same edge. In a 
directed network, edges are ordered pairs of nodes and an edge from x to y is denoted by 
x — > y. Edges x — > y and y — > x are distinct and may be present simultaneously. Unless 
stated otherwise, we assume that a network does not contain a self-loop (an "edge" joining 
a node to itself) or a multi-edge (several "edges" joining the same pair of nodes). In a 
multinetwork both loops and multi-edges are allowed. 

In an undirected network G(V,E), two nodes x and y are said to be adjacent to each 
other if x •H- y G E. The set of nodes adjacent to a node x, the neighborhood of x, is 
denoted by T x . Degree of node x is defined as k x = \T X \. The degree distribution, P{k), 
is defined as the probability that a randomly selected node is of degree k. In a regular 
network, every node has the same degree ko and thus P(k) = Sk,k - In the classical Erdos- 
Renyi random network [57] where each pair of nodes is connected by an edge with a given 
probability p, the degree distribution follows a binomial form [5_E] 

p(k)=( N ~ i y(i- P ) N - i - k , (i) 



3 Websites like Foursquare, Gowalla, Google Latitude, Facebook, Jiapang, and others already provide 
location-based services and show that many people want to share their location information and get 
location-based recommendations. 
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Figure 1: A simple undirected network with 6 nodes and 7 edges. The node degrees are k\ — 1, &2 = &5 = 3 
and k 3 = fc 4 = fc 6 = 2, corresponding to the distribution P(l) = 1/6, P(2) = 1/2 and P(3) = 1/3. The 
diameter and average distance of this network are d max = 3 and d = 1.6, respectively. The clustering 
coefficients are c 2 = g, c 3 = C4 = 0, c 5 = | and C6 = 1, and the average clustering coefheient is C = 0.3. 



where iV = | V| is the number of nodes in the network. This distribution has a characterized 
scale represented by the average degree k = p(N — 1). At the end of the last century, 
researchers turned to investigation of large-scale real networks where it turned out that 
their degree distributions often span several orders of magnitude and approximately follow 
a power-law form 

P(k) ~ k~\ (2) 

with 7 being a positive exponent usually lying between 2 and 3 [5] ■ Such networks are called 
scale-free networks as they lack a characteristic scale of degree and the power-law function 
P(k) is scale-invariant [59]. Note that detection of power-law distributions in empirical 
data requires solid statistical tools [6Q1 [61]. For a directed network, the out-degree of a 
node x, denoted by k ont , is the number of edges starting at x, and the in-degree k m is the 
number of edges ending at x. The in- and out-degree distribution of a directed network in 
general differ from each other. 

Generally speaking, a network is said to be assortative if its high-degree nodes tend 
to connect with high-degree nodes and the low-degree nodes tend to connect with low- 
degree nodes (it is said to be disassortative if the situation is opposite). This degree-degree 
correlation can be characterized by the average degree of the nearest neighbors [62] 163] or 
a variant of Pearson coefficient called assortativity coefficient [6U [65]. The assortativity 
coefficient r lies in the range — 1 < r < 1. If r > the network is assortative; if r < 0, 
the network is disassortative. Note that this coefficient is sensitive to degree heterogeneity. 
For example, r will be negative in a network with very heterogeneous degree distribution 
(e.g., the Internet) regardless to the network's connecting patterns [66J. 

The number of edges in a path connecting two nodes is called length of the path, and 
distance between two nodes is defined as the length of the shortest path that connects 
them. The diameter of a network is the maximal distance among all node pairs and the 
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average distance is the mean distance averaged over all node pairs as 



d 



1 



(3) 



N(N- 1) 



x^y 



where d xy is the distance between x and y^ Many real networks display a so-called small- 
world phenomenon: their average distance does not grow faster than the logarithm of the 
network size [OSl EH] ■ 

The importance of triadic clustering in social interaction systems has been realized for 
more than 100 years [70]. In social network analysis [71], this kind of clustering is called 
transitivity, defined as three times the ratio of the total number of triangles in a network 
to the total number of connected node triples. In 1998, Watts and Strogatz [69J proposed 
a similar index to quantify the triadic clustering, called clustering coefficient. For a given 
node x, this coefficient is defined as the ratio of the number of existing edges between x's 
neighbors to the number of neighbor pairs, 



where e x denotes the number of edges between k x neighbors of node x (this definition is 
meaningful only if k x > 1). The network clustering coefficient is defined as the average of 
c x over all x with k x > 1. It is also possible to define the clustering coefficient as the ratio 
of 3 x number of triangles in the network to the number of connected triples of vertices, 
which is sometimes referred to as "fraction of transitive triples" [7J. Note that the two 
definitions can give substantially different results. 

Figure [T] illustrates the above definitions for a simple undirected network. For more 
information about network measurements, readers are encouraged to refer an extensive 
review article [72] on characterization of networks. 

3.2. Bipartite Networks and Hypergraphs 

A network G(V,E) is a bipartite network if there exists a partition (Vy, V2) such that 
V\ U Vi = V, V\ D V2 = 0, and every edge connects a node of V\ and a node of V2. 
Many real systems are naturally modeled as bipartite networks: the metabolic network 
[73] consists of chemical substances and chemical reactions, the collaboration network [71] 
consists of acts and actors, the Internet telephone network consists of personal computers 
and phone numbers [75], etc. We focus on a particular class of bipartite networks, called 
web-based user-object networks [51], which represent interactions between users and objects 
in online service sites, such as collections of bookmarks in delicious.com and purchases of 
books in amazon.com. As we shall see later, these networks describe the fundamental 
structure of recommender systems. Web-based user-object networks are specific by their 



When no path exists between two nodes, we say that their distance is infinite which makes the average 
distance automatically infinite too. This problem can be avoided either by excluding such node pairs from 
averaging or by using the harmonic mean [ST] . 



e 



X 



(4) 
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Figure 2: An illustration of the one-to-one correspondence between a hypergraph (a) and a bipartite 
network (b). There are three hyperedges, X — {1, 2, 4}, Y = {4, 5, 6} and Z = {2, 3, 5, 6}. 

gradual evolution where both nodes and links are added gradually. By contrast, this cannot 
happen in, for example, act-actor networks (e.g., one can not add authors to a scientific 
paper after its publication). 

Most web-based user-object networks share some structural properties. Their object- 
degree distributions obey a power-law-like form P(k) ~ fc~ 7 , with 7 « 1.6 for the Internet 
Movie Database (IMDb) [76], 7 ~ 1.8 for the music-sharing site audioscrobbler.com [77], 
7 w 2.3 for the e-commerce site amazon.com [51], and 7 ~ 2.5 for the bookmark-sharing 
site delicious.com [51]. The form of the user-degree distribution is usually between an 
exponential and a power law [5T] , and can be well fitted by the Weibull distribution [78] 

P(k) ~ k^ 1 exp [ - {k/koY] (5) 

where k is a constant and \x is the stretching exponent. Connections between users and 
objects exhibit a disassortative mixing pattern [76| I5T]. 

A straightforward extension of the definition of bipartite network is the so-called mul- 
tipartite network. For an r-partite network G(V, E), there is an r-partition V\, V2, • • • ,V r 
such that V = V\ U V 2 U • • • U V r , D Vj = whenever i =fi j, and no edge joins two 
nodes in the same set Vi for all 1 < i < r. The tripartite network representation has 
found its application in collaborative tagging systems (also called folksonomies in the lit- 
erature) [79] [801 EU E2], where users assign tags to online resources, such as photographs 
in flickr.com, references in CiteULike.com and bookmarks in delicious.com. 

Note that some information is lost in the tripartite representation. For example, given 
an edge connecting a resource and a tag, we do not know which user (or users) contributed 
to this edge. To resolve this, hypergraph [S3] can be used to give an exact representation 
of the full structure of a collaborative tagging system. In a hypergraph H(V,E), the 
hyperedge set E is a subset of the power set of V, that is the set of all subsets of V. 
Link e can therefore connect multiple nodes. Analogously to ordinary networks, node 
degree in a hypergraph is defined as the number of hyperedges adjacent to a node and the 
distance between two nodes is defined as the minimal number of hyperedges connecting 
these nodes. The clustering coefficient ISO. i!84j and community structure [EH EH] can also 
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Figure 3: A hypergraph illustration of collaborative tagging networks, (left) A triangle-like hyperedge [5T] , 
which contains three types of vertices, depicted by one red circle, one green rectangle and one blue triangle 
which respectively represent a user, a resource and a tag. (right) A descriptive hypergraph consists of two 
users, four resources and three tags. Take user U2 and resource Ri for example, the measurements are 
denoted as: (i) U2 has participated in six hyperedges, which means its hyperdegree is 6; (ii) U2 has directly 
connected to three resources and three tags. As defined by Eq. (jfjj, it suggests it possibly has 3x3=9 
hyperedges in maximal. Thus its clustering coefficient equals 6/9«0.667 , where 6 is its hyperdegree; 
Comparatively, as defined by Eq. (7j), its clustering coefficient DhiJJz) — y|5|=0.75; (iii) the shortest path 
from U2 to Ri is U% — T\ — R\, which indicates the distance between U2 and R\ is 2. 



be defined and quantified following the definitions in ordinary networks. Notice that there 
is a one-to-one correspondence between a hypergraph and a bipartite network. Given a 
hypergraph H(V, E), the corresponding bipartite network G(V, E') contains two node sets, 
as V = V U E, and x G V is connected with Y 6 E if and only if x E Y (see Figure [2] for 
an illustration). 

Hypergraph representation has already found applications in ferromagnetic dynamics 
[HSIET], population stratification (88], cellular networks [89], academic team formation [90] . 
and many other areas. Here we are concerned more about the hypergraph representation 
of collaborative tagging systems [SH EH E2] where each hyperedge joins three nodes (rep- 
resented by a triangle-like in Figure [3]), user u, resource r and tag t, indicating that u 
has given t to r. A resource can be collected by many users and given several tags by 
a user, and a tag can be associated with many resources, resulting in small-world hyper- 
graphs [HU [92] (Figure |3] shows both the basic unit and extensive description). Moreover, 
hypergraphs for collaborative tagging systems have been shown to be highly clustered, 
with heavy-tailed degree distribution^ and of community structure [HU H2] • A model for 



5 The degrees of users, resources and tags are usually investigated separately. For flickr.com and CiteU- 
Like.com, the user and tag degree distributions are power-law-like, while the resource degree distributions 
are much narrower because in flickr.com, a photograph is only collected by a single user and in CiteU- 
Like.com, a reference is rarely collected by many users |84j . By contrast, in delicious.com, a popular 
bookmark can be collected by thousands of users and thus the resource degree distribution is of a power- 
law kind [55]. 
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evolving hypergraphs can be found in [92J. 

Generally, to evaluate a hypergraph from the perspective of complexity science, the fol- 
lowing quantities (Figure [3] gives a detailed description of these quantities) can be applied: 

(i) hyperdegree: The degree of a node in a hypergraph can be naturally defined as the 
number of hyperedges adjacent to it. 

(ii) hyperdegree distribution: defined as the proportion that each hyperdegree occupies, 
where hyperdegree is defined as the number of hyperedges that a regular node participates 
in. 

(iii) clustering coefficients: defined as the proportion of real number of hyperedges to all 
the possible number of hyperedges that a regular node could have (HQ]- e.g., the clustering 
coefficient for a user, C u , is defined as 

C u = -jj^r, (6) 

where k u is the hyperdegree of user u, R u is the number of resources that u collects and 
T u is the number of tags that u possesses. The above definition measures the fraction 
of possible pairs present in the neighborhood of u. A larger C u indicates that u has 
more similar topic of resources, which might also show that u has more concentrated on 
personalized or special topics, while smaller C u might suggest that s/he has more diverse 
interests. Similar definitions can also be defined for measuring the clustering coefficient of 
resources and tags. 

An alternative metric, named hyperedge density, is proposed by Zlatic et al [HI] . Taking 
a user node u again as an example, they define the coordination number of u as z(u) = 
Ru + T u . Given k(u), the maximal coordination number is z mayi (u) = 2k(u), while the 
minimal coordination number is z min (u) = 2n for n(n — 1) < k{u) < n 2 and z min (u) = 2n+l 
for n 2 < k{u) < n(n + 1), with n some integer. Obviously, a local tree structure leads 
to maximal coordination number, while the maximum overlap corresponds to the minimal 
coordination number. Therefore, they define the hyperedge density as 



D h (u) = " ma . x( f_ * ( " } < D h (u) < 1. (7) 

The definition of hyperedge density for resources and tags is similar. Empirical analysis 
indicates a high clustering behavior under both metrics [5CT1 IM]. The study of hypregraph 
for the collaborative tagging networks has just been unfolding, and how to properly quantify 
the clustering behavior, the correlations and similarities between nodes, and the community 
structure is still an open problem. 

(iv) average distance: defined as the average shortest path length between two random 
nodes in the whole network. 

3.3. Recommender Systems 

A recommender system uses the input data to predict potential further likes and in- 
terests of its users. Users' past evaluations are typically an important part of the input 
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Contents 

Users Items 

Figure 4: Illustration of a recommender system consisted of five users and four books. The basic infor- 
mation contained by every recommender system is the relations between users and objects that can be 
represented by a bipartite graph. This illustration also exhibits some additional information frequently 
exploited in the design of recommendation algorithms, including user profiles, object attributes and object 
content. 

data. Let M be the number of users and let N be the number of all objects that can 
be evaluated and recommended. Note that object is simply as a generic term which can 
represent books, movies, or any other kind of consumed content. To stay in line with 
standard terminology, we sometimes use item which has the same meaning. To make the 
notation more clear, we restrict to Latin indices % and j when enumerating the users and 
to Greek indices a and /3 when enumerating the objects. Evaluation/rating of object a 
by user i is denoted as r ia . This evaluation is often numerical in an integer rating scale 
(think of Amazon's five stars) — in this case we speak of explicit ratings. Note that the 
common case of binary ratings (like/dislike or good/bad) also belongs to this category. 
When objects are only collected (as in bookmark sharing systems) or simply consumed 
(as in online newspaper or magazine without rating systems) or when "like" is the only 
possible expression (as on Facebook), we are left with unary ratings. In this case, r ia = 1 
represents a collected/consumed/liked object and r ia = represents a non-existing evalua- 
tion (See Fig. [4]). Inferring users' confidence levels of ratings is not a trival task, especially 
from the binary or unary ratings. Accessorial information about users' behavior may be 
helpful, for example, the users' confidence levels can be estimated by their watching time 
of television shows and with the help of this information, the quality of recommendation 
can be improved [93J. Even if we have explict ratings, it does not mean we know how and 
why people vote with these ratings-Do they have standards of numerical ratings or they 
just use ratings to present orders? Recent evidence [M] to some extent supports the latter 
ansatz. 
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Alice Bob Carol 



Titanic 5 15 
2001: A Space Odyssey 1 5 2 
Casablanca 4 2? 



Table 2: Recommendation process in a nutshell: to estimate the potential favorable opinion of Carol about 
Casablanca, one can use the similarity of her with those of Alice. Alternatively, one can note that ratings 
of Titanic and Casablanca follow a similar pattern, suggesting that people who liked the former might also 
like the latter. 



The goal of a recommender system is to deliver lists of personalized "recommended" 
objects to its users. To this end, evaluations can be predicted or, alternatively, recommen- 
dation scores can be assigned to objects yet unknown to a given user. Objects with the 
highest predicted ratings or the highest recommendation scores then constitue the recom- 
mendation list that is presented to the target user. There is an extensive set of performance 



metrics that can be used to evaluate the resulting recommendation lists (see Sec. 3.4). The 
usual classifications of recommender systems is as follows [15] : 

1. Content-based recommendations: Recommended objects are those with content sim- 
ilar to the content of previously preferred objects of a target user. We present them 
in Sec. 14331 

2. Collaborative recommendations: Recommended objects are selected on the basis of 
past evaluations of a large group of users. They can be divided into: 

(a) Memory-based collaborative filtering: Recommended objects are those that were 
preferred by users who share similar preferences as the target user, or, those that 
are similar to the other objects preferred by the target user. We present them 
in Sec. 4 (Standard similarity-based methods) and Sec. [7] (methods employing 
social filtering). 

(b) Model-based collaborative filtering: Recommended objects are selected on mod- 
els that are trained to identify patterns in the input data. We present them in 
Sections [5] (dimensionality reduction methods) and [6] (diffusion-based methods). 

3. Hybrid approaches: These methods combine collaborative with content-based meth- 
ods or with different variants of other collaborative methods. We present them in 
Sec. IO 

3.4- Evaluation Metrics for Recommendation 

Given a target user i, a recommender system will sort all i's uncollected objects and 
recommend the top-ranked objects. To evaluate recommendation algorithms, the data is 
usually divided into two parts: The training set E T and the probe set E p . The training 
set is treated as known information, while no information from the probe set is allowed 
to be used for recommendation. In this section we briefly review basic metrics that are 
used to measure the quality of recommendations. How to choose a particular metric (or 
metrics) to evaluate recommendation performance depends on the goals that the system is 
supposed to fulfill. Of course, the ultimate evaluation of any recommender system is given 
by the judgement of its users. 
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3. 4-1- Accuracy Metrics 

Rating Accuracy Metrics. The main purpose of recommender systems is to predict users' 
future likes and interests. A multitude of metrics exist to measure various aspects of 
recommendation performance. Two notable metrics, Mean Absolute Error (MAE) and 
Root Mean Squared Error (RMSE), are used to measure the closeness of predicted ratings 
to the true ratings. If r ua is the true rating on object a by user i, r ia is the predicted 
rating and E p is the set of hidden user-object ratings, MAE and RMSE are defined as 



MAE = — p| l r - " ? * 



(i,a)£E p 

RMSE=(^ Yl (n a -fia) 2 ) 1/2 . (9) 



Lower MAE and RMSE correspond to higher prediction accuracy. Since RMSE squares the 
error before summing it, it tends to penalize large errors more heavily. As these metrics 
treat all ratings equally no matter what their positions are in the recommendation list, 
they are not optimal for some common tasks such as finding a small number of objects 
that are likely to be appreciated by a given user (Finding Good Objects). Yet, due to their 
simplicity, RMSE and MAE are widely used in the evaluation of recommender systems. 

Rating and Ranking Correlations. Another way to evaluate the prediction accuracy is to 
calculate the correlation between the predicted and the true ratings. There are three well- 
known correlation measures, namely the Pearson product-moment correlation [95], the 
Spearman [HE] correlation and Kendall's Tau [HTj. The Pearson correlation measures the 
extent to which a linear relationship is present between the two sets of ratings. It is defined 
as 

where r a and r a are the true and predicted ratings, respectively. The Spearman correlation 
coeffcient p is defined in the same manner as the Pearson correlation, except that r a and r a 
are replaced by the ranks of the respective objects. Similarly to the Spearman correlation, 
Kendall's Tau also measures the extent to which the two rankings agree on the exact values 
of ratings. It is defined as r = (C — D)/(C + D) where C is the number of concordant 
pairs — pairs of objects that the system predicts in the correct ranked order and D is the 
number of discordant pairs — pairs that the system predicts in the wrong order, r = 1 
when the true and predicted ranking are identical and r = — 1 when they are exactly 
opposite. For the case when objects with equal true or predicted ratings exist, a variation 
of Kendall's Tau was proposed in [13] 



pcc = \ 7° ; (io) 



r) 2 



C-D 

T « . 11 

y/(C + D + S T )(C + D + S P ) 

where St is the number of object pairs for which the true ratings are the same, and Sp 
is the number of object pairs for which the predicted ratings are the same. Kendall's Tau 
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metric applies equal weight to any interchange of successively ranked objects, no matter 
where it occurs. However, interchanges at different places, for example between top 1 and 
2, and between 100 and 101, may have different impacts. Thus a possible improved metric 
could give more weight to object pairs at the top of the true ranking. 

Similar to Kendall's Tau, the normalized distance-based performance measure (NDPM) 
was originally proposed by Yao [98J to compare two different weakly ordered rankings. It 
is based on counting the number of contradictory pairs C~ (for which the two rankings 
disagree) and compatible pairs C u (for which one ranking reports a tie while the other 
reports strict preference of one object over the other). Denoting the total number of strict 
preference relationships in the true ranking as C, NDPM is defined as 

NDPM = ^ . (12) 

Since this metric does not punish the situation where the true ranks are tied, it is more 
appropriate than correlation metrics for domains where users are interested in objects that 
are good-enough. 

Classification Accuracy Metrics. Classification metrics are appropriate for tasks such as 
"Finding Good Objects", especially when only implicit ratings are available (i.e., we know 
which objects were favored by a user but not how much they were favored). When a 
ranked list of objects is given, the threshold for recommendations is ambiguous or variable. 
To evaluate this kind of systems, one popular metric is AUC (Area Under ROC Curve), 
where ROC stands for the receiver operating characteristic [99] (for how to draw a ROC 
curve see [13]). AUC attempts to measure how a recommender system can successfully 
distinguish the relevant objects (those appreciated by a user) from the irrelevant objects (all 
the others). The simplest way to calculate AUC is by comparing the probability that the 
relevant objects will be recommended with that of the irrelevant objects. For n independent 
comparisons (each comparison refers to choosing one relevant and one irrelevant object), 
if there are n' times when the relevant object has higher score than the irrelevant and n" 
times when the scores are equal, then according to |100] 

AUC — — + ®* m '\ (13 ) 
n 

Clearly, if all relevant objects have higher score than irrelevant objects, AUC = 1 which 
means a perfect recommendation list. For a randomly ranked recommendation list, AUC = 
0.5. Therefore, the degree of which AUC exceeds 0.5 indicates the ability of a recommen- 
dation algorithm to identify relevant objects. Similar to AUC is a so-called Ranking Score 
proposed in [101J. For a given user, we measure the relative ranking of a relevant object in 
this user's recommendation list: when there are o objects to be recommended, a relevant 
object with ranking r has the relative ranking r/o. By averaging over all users and their 
relevant objects, we obtain the mean ranking score RS — the smaller the ranking score, the 
higher the algorithm's accuracy, and vice versa. 
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Since real users are usually concerned only with the top part of the recommendation 
list, a more practical approach is to consider the number of a user's relevant objects ranked 
in the top-L places. Precision and recall are the most popular metrics based on this. For 
a target user i, precision and recall of recommendation, Pi(L) and Ri(L), are defined as 

P i( L)=*f, *<L) = ^ (14) 

where di(L) indicates the number of relevant objects (objects collected by i that are present 
in the probe set) in the top-L places of the recommendation list, and Di is the total number 
of i's relevant objects. Averaging the individual precision and recall over all users with 
at least one relevant object, we obtain the mean precision and recall, P(L) and R(L), 
respectively. These values can be compared with precision and recall resulting from random 
recommendation, leading to precision and recall enhancements as defined in [37] 

MN N 
e P (L) = P(L)—, e R (L) = R{L) — , (15) 

where M and iV are the number of users and objects, respectively, and D is the total 
number of relevant objects. While precision usually decreases with L, recall always grows 
with L. One may combine them into a less L-dependent metric |102[ 1103] 

2PB 

FAL) = ^ (16) 

which is called -F\-score. Many other measurements which combine precision and recall are 
used to evaluate the effectiveness of information retrieval, but rarely applied to evaluate 
recommendation algorithms: Average Precision, Precision- at-Depth, R-Precision, Recipro- 
cal Rank |lU4j . Binary Preference Measure |lU5j . A detailed introduction and discussion 
of each combination index can be found in [106] . 

3.4-2. Rank-weighted Indexes 

Since users have limited patience on inspecting individual objects in the recommended 
lists, user satisfaction is best measured by taking into account the position of each relevant 
object and assign weights to them accordingly. Here we introduce three representative in- 
dexes that follow this approach. For a detailed discussion of their strengths and weaknesses 
see [13J. 

Half-life Utility. The half-life utility metric attempts to evaluate the utility of a recommen- 
dation list to a user. It is based on the assumption that the likelihood that a user examines 
a recommended object decays exponentially with the object's ranking. The expected utility 
of recommendations given to user % hence becomes |107] 

_ ^ max(r M - d, 0) 

a=l 
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where objects are sorted by their recommendation score fj a in descending order, 0{ a rep- 
resents the predicted ranking of object a in the recommendation list of user i, d is the 
default rating (for example, the average rating), and the "half-life" h is the rank of the 
object on the list for which there is a 50% chance that the user will eventually examine 
it. This utility can be further normalized by the maximum utility (which is achieved when 
the user's all known ratings appear at the top of the recommendation list). When HLi is 
averaged over all users, we obtain an overall utility of the whole system. 

Discounted Cumulative Gain. For a recommendation list of length L, DCG is defined as 

b L 

DCG(b) = J2r n + i"^' ( 18 ) 
ti n tbti lo ^ n 

where r n indicates the relevance of the n-th ranked object (r n = 1 for a relevant object 
and zero otherwise) and b is a persistence parameter which was suggested to be 2. The 
intention of DCG is that highly ranked relevant objects give more satisfaction and utility 
than badly ranked ones. 

Rank-biased Precision. This metric assumes that users always check the first object and 
progress from one object to the next one with certain (persistence) probability p (with a 
complementary probability 1 — p, the examination of the recommendation list ends). For 
a list of length L, the rank-biased precision metric is defined as [106J 

L 

RBP=(l-p)Y j r n p n -\ (19) 

71=1 

where r n is the same as in DCG. RBP is similar to DCG, the difference is that RBP 
discounts relevance via a geometric sequence, while DCG does so using a log-harmonic 
form. 

3.4-3. Diversity and Novelty 

Even a successfully recommended relevant object has little value to a user when it 
is notorious. To complement the above accuracy-probing metrics, several diversity- and 
novelty-probing metrics have been proposed recently j33j EH [L09J and we introduce them 
here. 

Diversity. Diversity in recommender systems refers to how different the recommended 
objects are with respect to each other. There are two levels to interpret diversity: one 
refers to the ability of an algorithm to return different results to different users — we call 
it Inter-user diversity (i.e., the diversity between recommendation lists). The other one 
measures the extent to which an algorithm can provide diverse objects to each individual 
user — we call it Intra-user diversity (i.e., the diversity within a recommendation list). 
Inter-user diversity [110] is defined by considering the variety of users' recommendation 
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lists. Given users i and j, the difference between the top L places of their recommendation 
lists can be measured by the Hamming distance 

Hij(L) = 1 - (20) 

where Qij(L) is the number of common objects in the top-L places of the lists of users i 
and j. If the lists are identical, Hij(L) = 0, while if their lists are completely different, 
Hij(L) = 1. Averaging Hij(L) over all user pairs, we obtain the mean Hamming distance 
H(L). The greater its value, the more diverse (more personalized) recommendation is 
given to the users. 

Denoting the recommended objects for user i as {01,02, ••• ,ol}, similarity of these 
objects s(o a ,0/3) can be used to measure the intra-user diversity (this similarity can be 
obtained either directly from the input ratings or from object metadata) The average 

similarity of objects recommended to user i, 

can be further averaged over all users to obtain the mean intra-similarity of the recommen- 
dation lists, I(L). The lower is this quantity, the more diverse objects are recommended 
to the users. Notably, intra-list diversity can be used to enhance improve recommendation 
lists by avoiding recommendation of excessively similar objects [35]. The rank-sensetive 
version can be obtained by introducing a discount function of the object's rank in recom- 
mendation list [109J. 

Novelty and Surprisal. The novelty in recommender systems refers to how different the 
recommended objects are with respect to what the users have already seen before. The 
simplest way to quantify the ability of an algorithm to generate novel and unexpected 
results is to measure the average popularity of the recommended objects 

M 

N ^ = ^EE^ ( 22 ) 

»=i aSOjj 

where O l R is the recommendation list of user % and k a denotes the degree of object a 
(i.e., the popularity of object a). Lower popularity indicates higher novelty of the results. 
Another possibility to measure the unexpectedness is using the self-information (surprisal) 
|112] of recommended objects. Given an object a, the chance that a randomly-selected 
user has collected it is k a /M and thus its self-information is 

U a =\og 2 {M/k a ). (23) 

A user-relative novelty variant can be defined by restricting the observations to the target 
user, namely caculating the mean self-information of target user's top-L objects. Averaging 
over all users we obtain the mean top-L surprisal U(L). With a similar resulting formula, 
a discovery-based novelty was proposed in |109] by considering the propability that an 
object is known or familiar to a random user. 
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3.4-4- Coverage 

Coverage measures the percentage of objects that an algorithm is able to recommend 
to users in the system. Denoting the total number of distinct objects in top L places of all 
recommendation lists as Nd, the L-dependent coverage is defined as 

COV(L) = N d /N. (24) 

Low coverage indicates that the algorithm can access and recommend only a small number 
of distinct objects (usually the most popular ones) which often results in little diverse rec- 
ommendations. On the contrary, algorithms with high coverage are more likely to provide 
diverse recommendations |113j . From this viewpoint, coverage can be also considered as 
a diversity metric. In addition, coverage is helpful to better evaluate results of accuracy 
metrics [114]: recommending popular objects is likely to be of high accuracy but of low 
coverage. A good recommendation method is expected to be of both high accuracy and 
coverage. 

The choice of a particular metric (or metrics) to evaluate a recommender system de- 
pends on the goals that the system is supposed to fulfill. In practice, one may specify 
different goals for new and experienced users which further complicates the evaluation pro- 
cess. For a better overview, Table [3] summarizes the described metrics for evaluation of 
recommender systems. 

4. Similarity-based methods 

Similarity-based methods represent one of the most successful approaches to recommen- 
dation. They have been studied extensively and found various applications in e-commerce 
[115, 116J. This class of algorithms can be further divided into methods employing user 
and item similarity, respectively. The basic assumption of a method based on user sim- 
ilarity is that people who agree in their past evaluationes tend to agree again in their 
future evaluations. Thus, for a target user, the potential evaluation of an object is esti- 
mated according to the ratings from users ("taste mates") who are similar to the target 
user (see Fig. [5] for a schematic illustration). Different from user similarity, an algorithm 
based on item similarity recommends a user the objects that are similar to what this user 
has collected before. Note that, sometimes the opinions from dissimilar users |117j or the 
negative ratings [118] 1119] can play a significant (even positive) role in determining the rec- 
ommendation, especially when the data set is very sparse and thus the information about 
relevance is more important than that about correlation [120J. For additional information 
see the recent review articles |121[ Il22j , and |123j is a nice survey that contains a number 
of similarity indices. 

4-1- Algorithms 

Here we briefly introduce the conventional similarity-based algorithms which are often 
referred to as memory-based collaborative filtering techniques. The term "collaborative 
filtering" was introduced by creators of the first commercial recommender system, Tapestry 
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Table 3: Summary of the presented recommendation metrics. The third column represents the preference 
of the metric (e.g., smaller MAE means higher rating accuracy). The fourth column describes the scope 
of the metric. The last two columns show whether the metric is obtained from a ranking and whether it 
depends of the length of the recommendation list L. 



Name 


Symbol 


Preference 


Scope 


Rank 


L 


MAE 


MAE 


small 


rating accuracy 


No 


No 


RMSE 


RMSE 


small 


rating accuracy 


No 


No 


Pearson 


PCC 


large 


rating correlation 


No 


No 


Spearman 


P 


large 


rating correlation 


Yes 


No 


Kendall's Tau 


T 


large 


rating correlation 


Yes 


No 


NDPM 


NDPM 


small 


ranking correlation 


Yes 


No 


Precision 


P{L) 


large 


classification accuracy 


No 


Yes 


Recall 


R{L) 


large 


classification accuracy 


No 


Yes 


Fi-score 


F 1 (L) 


large 


classification accuracy 


No 


Yes 


AUC 


AUC 


large 


classification accuracy 


No 


No 


Ranking score 


RS 


small 


ranking accuracy 


Yes 


No 


Half-life utility 


HL(L) 


large 


satisfaction 


Yes 


Yes 


Discounted Cumulative Gain 


DCG(b, L) 


large 


satisfaction and precision 


Yes 


Yes 


Rank-biased Precision 


RBP(p, L) 


large 


satisfaction and precision 


Yes 


Yes 


Hamming distance 


H(L) 


large 


inter-diversity 


No 


Yes 


Intra-similarity 


I(L) 


small 


intra-diversity 


No 


Yes 


Popularity 


N(L) 


small 


surprisal and novelty 


No 


Yes 


Self-information 


U(L) 


large 


unexpectedness 


No 


Yes 


Coverage 


COV(L) 


large 


coverage and diversity 


No 


Yes 
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Figure 5: A schematic representation of collaborative-filtering (CF) recommendation method: rating pre- 
diction for a given user-object pair is based on the user's and object's past ratings. 



[124] . derives from the fact that it requires collaboration of multiple agents who share 
their data to obtain better recommendation. In the following sections, we describe basic 
algorithms as well as main approaches to the computation of similarity which is a critical 
component of the recommendation process. 

4-1.1. User similarity 

The goal is to make automated prediction of user preferences by collecting evaluation 
data from many other users, especially those whose evaluations are similar to evaluations 
from the target user. Denote the rating from user u on object a as r ua and let T u be the set 
of objects that user u has evaluated. The average rating given by u is f u = j^-r J2 a &r u r ua- 
According to the standard collaborative filtering, the predicted rating of user u on object 
a is 

where U u denotes the set of users that are most similar to user u, s uv denotes the similarity 
between user u and user v and normalization factor. If instead of explicit 
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ratings, only the sets of objects collected by individual users are known (implicit ratings), 
we aim at predicting the objects which are most likely to be collected by a user in the 
future. According to |117j . Eq. (25) should be replaced with 

Pua ^ S uv Cl va (26) 

V&Uu 

where p ua is the recommendation score of object a for user u and a va is an element of 
the adjacency matrix of the user-object bipartite network (a va = 1 if user v has collected 
object a and a va = otherwise). 



As has already been made explicit by Eq. (25) and Eq. (26), only users most similar to 
given user u are usually considered. To obtain U u , two neighborhood selection strategies 
are usually applied: (i) correlation threshold |125] is based on selecting all users v whose 
similarity s uv surpasses a given threshold, (ii) maximum number of neighbors [126J consists 
of selecting those k users that are most similar to u (here A; is a parameter of the algorithm) . 
Restricting computation to the most similar users is not only computationally advantageous 
but in general it leads to superior results |127j . 

4-1.2. Item similarity 

In this case, item- item similarity s a p is employed instead of user- user similarity s uv . 
The simplest way is to estimate unknown ratings using the weighted average 



r ua = ^ ^ (27) 

where T u is the set of items evaluated by user u. Techniques limiting the computation of 
f ua to items that are most similar to a can be applied similarly as described above for user 
similarity. One of the advantages of this approach is that similarity between items tends to 
be more static than similarity between users, allowing its values and neighborhoods to be 
computed offline (i. e., before recommendation for a particular user is requested — this allows 
to shorten the time needed to obtain the recommendation). Hybrid collaborative filtering 
algorithms combining user-, item- or attribute-based similarity were proposed \V29\ 1130] . 
Their results show that this approach not only improves the prediction accuracy but it is 
also more robust to data sparsity. 

4-1.3. Slope One predictor 

Slope One predictor with the form f(x) = x + b, where b is a constant and x is a 
variable representing the rating values, is the simplest form of item-based collaborative 
filtering based on ratings |131j . It subtracts the average ratings of two items to measure 
how much more, on average, one item is liked than another. This difference is used to 
predict another user's rating of one of these two items, given his rating of the other. For 
example, consider a case where user i gave score 1 to item a and score 1.5 to item while 
user j gave score 2 to item a. Slope One then predicts that user j will rate item f3 with 
2 + (1.5 — 1) = 2.5 (see Fig. |6]for an illustration). 
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Figure 6: In the depicted case, the Slope One prediction would be x = 2 + (1.5 — 1) = 2.5. 



The Slope One scheme takes into account both information from other users who rated 
the same item and from the other items rated by the same user. In particular, only ratings 
by users who have rated some common items with the target user and only ratings of items 
that the target user has also rated are involved in the prediction process. Denoting the set 
of users who rated items a and (3 as S(a,/3), the average deviation of item (3 with respect 
to item a is defined as 

dev^ = . 28 

|S(«,/3)| 

Given a known rating r ua , Slope One predicts w's rating on item as r ua + dev^. By 



varying a in Eq. 28, we obtain different predictions. A reasonable overall predictor is their 
average value 

^=|^y[ £ (^ + dev a ,), (29) 

where R(w, a) is the set of items that have been both rated by u and co-rated with item 
(3. Note that predictions obtained from different items a have equal weight no matter how 
many users have co-rated a with (3. To take into account the fact that the credibility of 
dev a /3 depends crucially on |S(a,/3)| (the larger the overlap, the more trustful the value), 
one can introduce a Weighted Slope One prediction as 

~w = 52 Q \S(a,P)\(r uf3 + dev a p) 

E a \S(a,P)\ ■ { ] 

Another improvement of the basic Slope One algorithm is based on dividing the set of all 
items into items liked and disliked by a given user (a straightforward criterion to identify 
liked and disliked items is to check whether their rating were higher or lower than the 
average rating awarded by the given user). From these liked and disliked items, two 
separate predictions are then derived which are combined into one prediction at the very 
end. Denote by S +1 (a, (3) and 5 ,_1 (a, (3) the sets of users who like and dislike, respectively, 
both a and (3. The deviations for liked and disliked items are 

devg= E dev ^ = \S-HB a)\ ^ ^~ r ^- ( 31 ) 

The prediction for the rating of item (3 based on the rating of item a is either r JQ + dev^ 
or Tj a + devjl depending on whether the target user j likes or dislikes item a respectively. 
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The Bi-Polar Slope One is thus given by 

bi = £JS +1 (ft a)\(r ja + devg) + E a \S~\P, a)\{r 3a + devg) 

where the weights are chosen similarly as for Weighted Slope One. 

It was shown that Slope One can outperform linear regression (i.e., estimation by 
f(x) = ax + b) while having half the number of regressors |131[ 1114] . This simple approach 
also reduces storage requirements and latency of the recommender system. Slope One has 
been used as a building block to improve other algorithms |132| I133[ 1134] . For instance, 
it can be combined with user-based collaborative filtering to address the data sparsity 
problem via filling the vacant ratings of the user-item matrix by the Slope One scheme, 
and thus improving the prediction accuracy [132J. 

4-2. How to define similarity 

The key problem of similarity-based algorithms is how to define similarity between 
users or objects. When explicit ratings are available, similarity is usually defined using 
a correlation metric such as Pearson, for example (two users are considered as similar 
when they tend to give similar ratings to the objects they rate). When there is no rating 
information available, similarity can be inferred from the structural properties of the input 
data (two users are considered as similar when they liked/bought many objects in common). 
Besides, external information such as users' attributes, tags and objects' content meta 
information can be utilized to estimate similarity better. 

4-2.1. Rating-based similarity 

In many online e-commerce services, users are allowed to evaluate the consumed objects 
by ratings. For example, in Yahoo Music, users vote each song with one to five stars 
representing "Never play again" (*), "It is ok"(**), "Like it" (***), "Love it" (****) 
and "Can't get enough" (* * * * *) . With explicit rating information we can measure the 
similarity between two users or between two objects by Cosine index [T5J 1135] which is 
defined as 

•ST = Fii?T- < 33 > 

I sc II* y I 

For quantifying the similarity between users, r x , r y are rating vectors in the N- dimensional 
object space while for similarity between objects, r x , r y are vectors in M-dimensional user 
space. Note that, in the calculation of rating-based similarity, it is necessary to eliminate 
the rating tendencies of users and/or on items, otherwise the similarity is less meaningful. 
Actually, according to a recently reported smart method, in some rating systems, via 
proper usage of rating tendencies, one could predict the unknown ratings with remarkably 
higher accuracy than the simply similarity-based methods |114j . 

The rating correlation can also be measured by Pearson coefficient (PC) [151 H35j . To 
quantify similarity between users u and v , it reads 



b uv ~ /z^ / ' V°V 
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where O uv = T u D T v indicates the set of objects rated by both u and v. A constrained 
Pearson coefficient proposed by Shardanand and Maes |125] consists of substituting the 



user mean in Eq. 34 with a "central" rating (for example, in the scale from 1 to 5, one 
can set the central rating to be 3). The idea is to take into account the difference between 
positive (above the central rating) and negative ratings (below the central rating). A 
weighted Pearson coefficient is based on the idea of capturing the confidence that can be 
placed on similarity values (when two users evaluated only a few objects in common, their 
potentially high similarity should not be trusted as much as for a pair of users with many 
overlapping objects). It was proposed |136j to weight the Pearson coefficient as 

f PC \OuA f or \n I < TJ 
S WPC _ J b uv H iUI \ w uv\ 2: n i /gt-x 

[ s uv otherwise 

where if is a threshold, determined experimentally, beyond which the correlation measure 
can be trusted. 

Analogically, Pearson similarity between objects a and (3 reads 
pc _ Eu & ujr ua -f a )(r ul s-fp) 

where U a p is the set of users who rated both a and (3, and f a is the average rating of object 
a. Experiments have shown that the Pearson coefficient performs better than the cosine 
vector index |107j . When only binary ratings are available (like or dislike, purchase or no 
purchase, click or no click, etc.), the cosine and Pearson coefficient can still be applied to 
quantify the similarity of vectors with binary elements. For example, Amazon's patented 
algorithm |115] computes the cosine similarity between binary vectors representing users' 
purchases and use it in item-based collaborative filtering. 

4-2.2. Structural similarity 

As we have mentioned above, similarity can be defined using the external attributes 
such as tag and content information. However, the required data is usually very difficult 
to collect. Another simple and effective way to quantify the similarity, structural simi- 
larity |137j . is based solely on the network structure of the data. Recent research shows 
that the structural-based similarity can produce better recommendations that the Pearson 
correlation coefficient, especially when the input data is very sparse [120J. 

To calculate the structural similarity between users or objects, we generally project the 
user-object bipartite network which contains the complete information about the system 
into a monopartite user-user or object-object network (for more information on this aspect 
of similarity see [10T ]). In the simplest case, two users are considered similar if they have 
voted at least one common object (analogically, two objects are considered similar if they 
have been co- voted by at least one user). More refined similarity metrics that can be 
roughly categorized as node-dependent vs. path-dependent, local vs. global, parameter- 
free vs. parameter-dependent, and so on — here we review some of them. 
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Table 4: Mathematical definitions of the described node-dependent similarity indices. denotes the set 
of neighbors of node x (which can be either a user or an object node) and k x is the degree of node x. 

Index Definition 



CN 


$xy 


|r«nr w | 


Salton 


&xy 
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&xy 
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(i) Node- dependent similarity. The simplest weighted similarity index is Common 
Neighbors (CN) where the similarity of two nodes is directly given by the number of 
common neighbors (think of the number of users who bought both objects a and /3 or the 
number of objects shared by users u and v). By considering degrees of the two target nodes, 
six variations of CN were derived: Salton Index |138j . Jaccard Index [139] . S0rensen Index 
[140] . Hub Promoted Index (HPI) [141] . Hub Depressed Index (HDI) and Leicht-Holme- 
Newman Index (LHNlQ [H2]. One can further take into account degrees of respective 
common neighbors to reward less-connected neighbors with a higher weight as in Adamic- 
Adar Index (AA) [143] and Resource Allocation Index (RA) [100] . Note that since AA uses 
a logarithmic weighting, it penalizes high-degree common neighbors less than RA. Finally, 
Preferential Attachment Index (PA) builds on the classical preferential attachment rule in 
network science [144] . This index has been used to quantify the functional significance of 
links subject to various network-based dynamics, such as percolation [145] . synchronization 
[146] and transportation [147] . Note that these similarity can be computed also for a bi- 
partite network where common neighbors are objects and users when considering user and 
object similarity, respectively. A summary of mathematical definitions of these similarity 
indices is shown in Table HI 

(ii) Path- dependent similarity. The basic assumption here is that two nodes are similar 
if they are connected by many paths. Since elements of an n-th power of the adjacency 
matrix, A n , are equal to the number of distinct paths between respective pairs of nodes, 
path-dependent similarity metrics can be usually written in a compact form such as 

s L X y = (A 2 ),, + e(A 3 ) xy (37) 
for the Local Path Index [148] where only paths of length two and three count and e is a 



6 We use the abbreviation LHN1 to distinguish this index to another index named as LHN2 also proposed 
by Leicht, Holme and Newman. 
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damping parameter. (Note that in a bipartite network, only paths of an even length can 
exist between nodes of the same kind.) By including paths of all lengths, we obtain the 
classical Katz similarity |!49j which is defined as 

s K xy atz = (3A xy + (3 2 (/\ 2 ) xy + (3 3 (A 3 ) xy + ..., (38) 

where /3 is a damping factor controlling the path weights. This can be written as s Katz = 
(I — /3A) _1 — /. A variant of the Katz index, Leicht-Holme-Newman Index (LHN2) [142J, 
was proposed where the term (A l ) xy is replaced with (A l ) xy /E[(A l ) xy \ where E[X] is the 
expected value of X. 

(iii) Random- walk-based similarity. Another group of methods is based on random 
walks on networks. 

Average Commute Time: The average commute time between nodes x and y is 
defined as the average number of steps required by a random walker starting from node x 
to reach node y plus that from y to x. It can be obtained in terms of the pseudoinverse of 
the network's Laplacian matrix, L + , as |150[ 1151] 

n(x, y) = ((L + ) M + (l + ) yy - 2(L+) xy )E, (39) 

where E is the number of edges in the network. Assuming that two nodes are similar if 
they have small average commute time, similarity between nodes x and y can be defined 
as the reciprocal of their average commute time 

S ACT = 1 ( 4Q ) 

where the constant factor E has been removed. 

Cosine-based on L + : This index is an inner-product-based measure. In the Euclidean 
space spanned by v x = Az[J T e x where U is an orthonormal matrix composed of the eigen- 
vectors of L + ordered in a decreasing order of their eigenvalues A^, A = diag(Aa;), e x is a 
column base vector ((e x ) y ) = 5 xy ) and T is matrix transposition, elements of the pseudoin- 
verse of the Laplacian matrix are the inner products of the node vectors, {\- + ) xy = v x v y . 
Consequently, cosine similarity is defined as |151j 

T 

; + = — 



yy 

Random Walk with Restart: This index is a direct application of the PageRank 
algorithm |152j . Consider a random walker starting from node x recursively moves to a 
random neighbor with probability c and returns to node x with probability 1 — c. Denoting 
by q xy the resulting stationary probability that the walker is located at node y, we can 
write 

q x = cP T q x + (1 - c)e x (42) 

where P is the transition matrix with elements P xy = l/k x if x and y are connected and 
P xy = otherwise. The solution to this equation is 

q x = {l-c){I-cP T )- l e x . (43) 



s -+ = ^pL_ = )xy . (41) 



29 



Finally, the similarity index is defined as 

S xy Qxy 4" Qyx- (44) 

A fast algorithm to calculate this index was proposed |153] and the application to rec- 
ommender systems was studied in [120J where it was found that this similarity performs 
better than the Pearson correlation coefficient. 

SimRank: This index is defined based on the assumption that two nodes are similar if 
they are connected to similar nodes. This allows us to define SimRank in a self-consistent 
way |154] as 

^SimRank 

SimRank /~i £-^z£T x /—/z'^Ty zz (Ac\ 

s *y ~ u TTT, ' t 4 ^ 

" J x" J y 

where s xx = 1 and C G [0, 1] is a free parameter. SimRank can also be interpreted 
by the random-walk process: s ^ mRank measures how fast are two random walkers, who 
respectively start at nodes x and y, expected to meet at a certain node. 

Matrix Forest Index: This index introduces similarity between x and y as the ratio 
of the number of spanning rooted forests such that nodes x and y belong to the same 
tree rooted at x to all spanning rooted forests of the network (for details see [155] ). Its 
mathematical definition 

s MFI = {I + L)-\ (46) 

can be further parametrized to obtain a variant of MFI 

s PMFI = (I + al)- 1 , a>0. (47) 

According to the authors, a > determines the proportion of accounting for long connec- 
tions between vertices of the graph versus short ones. 

Local Random Walk: To measure similarity between nodes x and y, a random walker 
is introduced in node x and thus the initial occupancy vector is 7^(0) = e x . This vector 
evolves as 7r x (t + 1) = P T ir x (t) for t > 0. The LRW index at time step t is defined [156] as 

s L xy W (t) = qx7i xy (t) + q y 7r yx (t) (48) 

where q is the initial configuration function and t denotes the time step. In |156j it was 
suggested to use a simple approach where q is determined by node degree: q x = k x /M. Note 
that in bipartite networks, an even time step must be used to obtain similarity between 
nodes of the same kind. 

Superposed Random Walk: Similar to the RWR index, in [156] they proposed an- 
other index where the random walker is continuously released at the starting point, result- 
ing in a higher similarity between the target node and its nearby nodes. The mathematical 
expression reads 

t t 

4y W (t) = E CV) = E ^^( r ) + W*<T)]. (49) 

T=l T=l 

In [151] . several random-walk-based similarity indices, such as ACT, cos+ and MFI, 
were applied in collaborative filtering. Their experimental results show that in general, 
Laplacian-based similarities perform well. 
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4-2.3. Similarity involving external information 

Besides the fundamental user-object relations and the ratings, additional information 
can be exploited to define or improve the node similarity. 

(i) Attributes. The dimension and elements of the attribute vectors are defined in 
advance by some domain experts, and is identical to all users (objects) in the system. The 
similarity between two users (objects) is obtained by calculating the correlation of their 
corresponding attributes vectors. For example, user profiles, usually including age, sex, 
nationality, location, career, etc., can be simply applied to quantify the similarity between 
users based on the assumption that two users are similar when they have many common 
features. In |157j . a hybrid method considering both attributes of objects and the ratings 
was shown to provide better recommendations than when these two sources of information 
are used independently. However, the application of attributes represents some risks to user 
privacy — as shown by a recent work on de-anonymization of large datasets [25] , collection 
and utilization of attribute data poses several sensitive issues. For more information on 
the issues of user privacy see [158J. 

(ii) Contents. Modern information retrieval techniques allow us to automatically extract 
content and meta information of the available objects. Object similarity can hence be 
calculated based on the content comparison of the given objects. This is usually referred 
as content-based recommendation in literature |159j . Unlike collaborative filtering, in 
a content-based algorithm, recommendations are made based solely on the profile built 
up by analyzing the content of objects that the target user has rated in the past. The 
recommendation problem hence becomes a search for objects whose content is most similar 
to the content of objects already preferred by the target user. The classical method to 
weigh content is TF-IDF (term frequency - inverse document frequency) |138] , which is a 
weighing metric often used in information retrieval and text mining. A term, t, in a given 
document, d, is weighted as, 

W t4 = tf(t, d) x log — M— , (50) 

where tf (t, d) is the frequency of t in document d, \D\ is number of all observed documents. 
Then Wt,d can be used to measure the similarly of objects defined in Sec. 4.2.2 In addition, 
if two users have collected objects with similar content, we may assume that these two users 
are similar. 

Since both content-based method and collaborative filtering have their individual lim- 
itations, such as CF systems do not explicitly incorporate feature information and face 
the sparsity and cold-start problems, while content-based systems do not necessarily in- 
corporate the information in preference similarity across individuals (see summaries and 
discussions in Refs. [151 H21j ). many hybrid algorithms are proposed to avoid certain 
weaknesses in each approach and thereby improve the recommendation performance. The 
combination methods can be classified into four categories: (i) implement separate collab- 
orative and content-based methods and then combine their predictions [160, 1161] ; (ii) add 
content-based characteristics to collaborative models |162[ I163[ [164J; (iii) add collabora- 
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tive characteristics to content-based models |165j and (iv) develop a general unified model 
that integrate both content-based and collaborative characteristics \166\ I167[ I3TI I168[ I169j . 
However, these methods are only effective if the objects contain rich content information 
that can be automatically extracted. This is the case for recommendation of books, articles 
and bookmarks, but not for videos, music tracks or pictures 

(iii) Tags. Collaborative tagging systems emerged with the advent of Web2.0 [92] . 
Different from traditional taxonomy with hierarchical structure, tagging systems allow 
users to freely assign keywords (which are usually referred to as tags) to manage their 
own collections without the limitation of a preset vocabulary. Tags provide a rich source 
of information for recommendation purposes. With the tagging information, algorithms 
can be easily designed to calculate user similarity and object similarity by considering 
tag vectors in user and object space, respectively. To alleviate the effects of spam and 
magnify personalized user preferences, weighting techniques are often applied to measure 
the importance of each element in a given tag vector. 
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5. Dimensionality Reduction Techniques 

Dimensionality reduction aims at downsizing the amount of relevant data while pre- 
serving the major information content. It is often applied in areas such as data mining, 
machine learning and cluster analysis. Most techniques of dimensionality reduction in- 
volve feature extraction which makes use of hidden variables, or so called latent variables, 
to describe the underlying causes of co-occurrence data. In the context of movie selection, 
potential viewers may consider genres such as action, romance or comendy features in a 
movie, which consititute the latent variables. These latent variables are usually repre- 
sented by multi-dimensional vectors. A simplified picture of two-dimensional vectors of 
action and romance is shown in Fig. [7], which shows that user Peter has a preference in 
action movies, while user Mary prefers romantic content. Given these vectors and the 
corresponding vectors of movies, we can define the expected rating of a user on a move as 
the scalar product of their vectors. For instance, we expect Peter prefers movie /3 rather 
than a, while the opposite is true for Mary. Recommendations can thus be made once the 
vectors are computed. If K hidden variables are used, the latent vectors are K- dimensional 
and dimensionality reduction is achieved if K(N + M) < NM, since the number of relevant 
variables is reduced. In practise, these techniques are particularly suitable for large data 
sets which are costly to store and manipulate. 

Instead of introducing latent variables which describe interests and genres, users and ob- 
jects can also be assigned to individual classes which leads to reduction of data dimension. 
In this case, the co-ocurrence of a user-object pair is explained by the relation between the 
classes to which the user and the object belong to. Though the original intention for such 
classification is not to reduce the data dimension, the number of classes used is usually 
significantly smaller than the number of users and objects, which then results in reduction 
of dimensionality. 

Dimensionality reduction is in particular well applicable in collaborative filtering (it is 
sometimes referred to as model-based collaborative filtering), as for most applications only 
a small fraction of user-object pairs are observed such that the number of relevant vari- 
ables can be significantly reduced. Reductions in dimensionality effectively preserve the 
information content while drastically decreasing the computation complexity and memory 
requirements for making recommendations. In this section, several techniques of dimen- 
sionality reduction with implementation to recommender systems are discussed, including 
singular value decomposition (SVD) |170] . Bayesian clustering |171] . probabilistic latent 
semantic analysis (pLSA) |172] and latent dirichlet allocation (LDA) |173j . 

5.1. Singular Value Decomposition (SVD) 

We start with a N x M matrix R whose element r ia corresponds to the rating of user i 
to object a (if the rating has not yet been given, the corresponding element of R is zero). 
In the case without numeric ratings, R becomes the adjacency matrix as rj a = 0, 1 for 
connected and unconnected user-object pair, respectively. Recommendation process then 
aims to determine which presently zero entries of R have high chance to be non-zero in the 
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Figure 7: An example of a user's movie selection process with hidden variables. 



future. Note that R is a sparse matrix for most applications because only a small fraction 
of all its elements are different from zero. 

Dimensionality reduction is achieved by introducing K hidden variables which cate- 
gorize tastes of users and attributes of objects. The original R is approximated as the 
product of two matrices 

RkWV (51) 

where W and V are respectively matrices with dimension N x K and K x M. They contain 
the taste information for users and content information for objects, respectively, expressing 
them in terms of K hidden variables. Because of these hidden variables, SVD belongs to a 
broad class of latent semantic analysis (LSA) techniques. From the product of W and V 



in Eq. (51), we see that objects are selected by users based on the overlap between a user's 
tastes and a movie's attributes. When the number of hidden variables is smaller than N 
and M, the number of parameters needed to describe the system reduces from NM for 
the original R to NK + KM for the product WV. This approach is also known as matrix 
factorization (MF) as R is factorized into a product of matrices. 

To obtain W and V, singular value decompostion (SVD) is a common algebraic tool 
in LSA which results in downsizing of relevant variables and, at the same time, finding a 
good approximation of R. In SVD, R is factorized as 

R = W^V (52) 

where X is a K x K diagonal matrix, and equality in the above factorization holds with 
K = min(iV, M). The matrix £ contains the so-called singular values of R, which are 
indeed the square root of the eigenvalues of RR* (or R*R). To benefit from dimensionality 
reduction, we put K < min(iV, M) which corresponds to the K-r&nk approximation in 
SVD and include only the K largest singular values in S and replace the others by zero. 



Equality in Eq. (52) no longer holds and R is approximated with R given by 

R^R = wtV (53) 
where S is the ii'-rank approximation of E. 
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It can be shown that R can be found by minimizing the Frobenius norm of the matrix 
R — R, that is 

R = axgmm\\R - R'\\ (54) 
{ii>} 

with the rank of R' restricted to be K. The Frobenius norm is given by 




^-^11 = JYina - f ia ) 2 , (55) 



which corresponds to the root square error of R with respect to R, with fi a denoting the 
element (i, a) in R. SVD thus provides a simple cost function for measuring the agreement 
between R and R. To obtain R explicitly for a particular R, an simple iterative approach 



1 170] based on gradient descent can be employed. £ in Eq. ( 53 ) is first absorbed into either 



W or V to obtain R = WV as in Eq. (51). Our task is to obtain the optimal W and V 



for which R = WV minimizes the norm in Eq. (55). We now express fj a as 

K 



^2wi k v ka . (56) 



k=l 

After substitution of the above expression, we minimize the difference between and fj Q 
by differientiation of {r ia — fia) 2 , namely 

d ( K \ 

'r ia - r ia ) 2 = -2w ik f r ia - } j w ik v ka J , (57) 



dw ik 

Tice fia) = 2ffc Q I Ti a ^ Wi k V ka J . (5 
^ k=l ' 
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dv 



ka 



Since the norm is non-negative by definition, we minimize the norm square which leads 
to the same result while encountering simpler expressions in the process. The obtained 
gradients can be used to write a gradient descent-based updating procedure for Wi k and 
v ka in the form 

w ik (t + 1) = w ik {t) + 2r)W ik {t)e ia (t), (59) 
Vka(t + 1) = v ka (t) + 2r)V ka (t)e ia (t), (60) 

where t denotes the iteration step and 

K 

e ia (t) = r ia-^2 W ik{ t ) V ka{t). (61) 
k=l 

The learning rate rj > should be small to avoid big jumps in the solution space. With 
random initial conditions on Wi k and v ka , these equations are iterated until the squared 
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norm shows no further decrease. Other procedures such as the variants of stochastic 
gradient descent may be applied to improve computational efficiency |174j . As suggested 
by |17Uj . substracting a small term from the gradient can prevent large resulting weights 
of w ik and v ka (this is equivalent to minimizing \\R — -R|| 2 + f||W / || 2 + |||^|| 2 ? or similar to 
using the Tikhonov regularization which gives preference to solutions with small norms in 
ill-posed problems), which leads to the following update rule 

w ik (t + 1) = w ik (t) + rj(2w ik (t)e ia (t) - Xw ik (t)), (62) 
v ka (t + l) = v ka (t) + r](2v ka (t)e ia (t) -Xv ik (t)). (63) 

where a new parameter A > is introduced to this end; A > usually leads to better 
accuracy of the results. Using the resulting Wi k and v ka , we can compute R = WV which 
has non-zero values on entries where the input matrix R are zero (representing unexpressed 
evaluations) — element r ia then predicts the possible rating given by user i to object a. 

Note that while \\R — R\\ measures the "error" of R with respect to R, it is usually 
very different from (greater than) the error of the ultimate rating estimates. The reason 
for this, of course, lies in the fact that \\R — R\\ is minimized while knowing R, and the 
K(N + M) free parameters usually allow us to achieve very low value of \\R — R\\. To 
measure performance of this method correctly, the available data must be divided into a 



training set (which is used to "learn" W and V) and a test set (see Sec. 3.3 for details). 
Increasing K does not automatically improve the results: over-fitting the data (providing 
too many free parameters) can lead to inferior accuracy. The use of this and other machine- 
learning methods hence requires a certain amount of tests and/or experience. 

In addition to a simple iteration procedure, SVD also enjoys a flexibility in dealing 
with additional data. For instance, one can easily include the influence of individual rating 
bias in the framework. Suppose user i tends to give an average of 6« more score for all his 
items when compared to other users, while object a tends to receive b a more scores when 
compared to other items, the predicted scores can be expressed as |175] 

K 

f ia = H + h + b a + ^ W ikVka, (64) 

k=l 

where \x is the average value among all user-object pairs. In this case, \\R — R\\ is minimized 
with respect to 6j, b a , Wi k and v ka for all i,k and a. Other than individual bias, there 
are other variants of SVD which utilizes the relations between users in social network, for 
instance the similarity in taste between friends, to improve the recommendation accuracy 
by the factorized matrices [176] . 

Apart from the case with user-object ratings as the only input data, the above-described 
procedure can be generalized to incorporate additional information |170] . Suppose d ia is the 
additional information (e.g. date) associated with a given user-object pair. Transforming 
di a into positive integers / with 1 < I < L, the elements y k \ in the K x L matrix Y contain 
the relation between each of these additional data and the hidden variables. For example, 
a large number of movies with romantic content are reviewed on the Valentine day, leading 
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to a large value in the corresponding entries of romance and Valentine day in Y. With the 
additional information, the matrix R can be expressed as 

K 

ha = ^ W ik v kaVkd ta - (65) 
k=l 

Similar derivations based on gradient descent provide updating procedures for Wik, Vk a and 
Uk,d ia which can then be used to obtain R with the smallest error with respect to R. 

5.2. Bayesian Clustering 

Before describing a probabilistic version of LSA, we first introduce the Bayesian cluster- 
ing method which is also probabilistic but simpler in formulation. In Bayesian networks, 
the value of a variable depends only on the value of its parent variables. For example, 
the probability of a variable x is described by the conditional probability P(x\pa. x ) where 
pa^, is the parent variable of x. The joint probability of several independent variables can 
be factorized as P(xi, . . . ,xn) = YliLi ^(^Ip 3 ^,) which represents the dependency struc- 
ture of the variables. However, obtaining the most relevant dependency structure, z.e.the 
dependency relation between different nodes in the Bayesian network, is not a trivial task. 

For the purpose of personalized recommendation, we describe a two-sided clustering 
[171[ 1177] which is easy to implement. To obtain the rating for an unobserved user-object 
pair, one classifies users and objects respectively into K user and inject classes. The values 
of .fr uS er and -Kobject are parameters on the algorithm, similarly to K which was a parameter 
of SVD. We assume that there is a simple Bayesian network that underlies the input data — 
the simplest assumption is that r ia depends only on the user's class Cj and the object's 
class c a . The probability of r ia can then be written as 

^"uscr -^"object 

P{ r ia) = 22 ^2 P ( r ia\Ci,C a )P(Ci)P(c a ), 
Cj— 1 c a = 1 

To obtain an estimate of r ia , we need to find P(q), P(c a ) and P(r ia \ci,c a ), which is 
effectively P(r\x, y) as the rating r is dependent merely on the user class x and the object 
class y. This can be done by applying the inference methods including the marginal 
estimation by belief propagation |178[ 1179] and likelihood maximization by expectation 
maximization |18U] . Here we describe another simple scheme, known as the Gibbs sampling 
method [18 1J , which is similar to the heat bath algorithm in statistical physics [182, 1183] . 
Gibbs sampling is useful when the joint distribution of all variables is difficult to sample (in 
our case, P({r\(x, y)}, {q}, {c a })), while sampling the conditional probability of individual 
variables given all other variable is comparatively easy (e.g., P(ci>\{r\(x, y)}, {cj}_j/, {c a })). 
It is similar to the heat bath algorithm as it models the state of a system moving in a phase 
space and samples at certain time intervals the required (physical) quantities. 

Here we describe the Gibbs sampling scheme suggested in |171] to sample the state 
({r\(x,y)}, {ci}, {c a }) of the system which allows us to evaluate the predicted ratings for 
any unobserved user-object pair. This scheme was developed to sample binary ratings so 
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we assume that = 1 when user i has collected a or rated object by a score above a 
certain threshold; r ia = otherwise. In this case, one can represent P(r\x,y) by a single 
value variable P xy corresponding to the probability that a user from category x likes an 
object from category y. 

We start the algorithm with a random latent class assignment for all users and objects 
and evaluate N x , N y and N xy , respectively correspond to the number of users in class x, the 
number of objects in class y, and the number of observed user-object pairs (i.e., r ia = 1) for 
all pairs of classes. We then draw values of P xy from the beta distribution with parameters 
(N xy + l,N x N y — N xy + 1), or simply approximate P xy by its mean P xy = N xy /N x N y . 
Similary, the variables P x and P y , respectively defined as the probability that a random 
user or object is classified in class x or y, are drawn from Dirichlet distributions or simply 
approximated by P x = N x /N and P y = N y /M. All these values of P xy , P x and P y are used 
to evalulate the transition probability of the system from the present state to another state 
in the phase space. We then successively pick either a random user or a random object 
and update its latent class as |171j 

-^"object 

P{Ci = X) OC P x \ [ pEa»'<-«c a ,v(l_P xi/ )Ea(l-'-ia)«ca,V j (66) 
y=l 

for users, and 

^user _ 

P{c a = y) oc P y \{ P^ inaS ^(l - Pxy )^-n a )S H ,^ (67) 

x=l 

for objects. These equations involve large powers of probability values and may lead 
to inaccurate numerics during the computation. Instead of computating the probability 
direcly, one can first compute the powers and convert them to probability during the update 
of latent classes. The values of N x , N y and N xy , and thus of P x , P y and P xy , are updated 
after each update of user class or object class. After a sufficient number of iterations, we can 
start sampling estimated ratings of unobserved user-object pairs at regular time intervals 
which should be long enough to ensure low correlation between consecutive sampled states. 
For instance, the predicted rating for user i on object a can be obtained by 

ha = 2J Pxy{t c + tT)8 XtCi (t c+tT )8 UtCa (t c+tT ), (68) 
t 

where t c and T are respectively the convergence time and the sampling time interval. 
We can also store the state ({P xy }, {q}, {c a }) at each sampling time and use the above 
equation to obtain the predictions afterwards. 

We remark that the above two-sided Bayesian network corresponds to the simplest 
dependency structure which relates ratings to merely user and object classes. More com- 
prehensive Bayesian relation can be derived, for instance, to include the individual rating 
preference for objects |184j or to model the possibility of mixed membership |185j . An- 
other class of extensions is to build a probabilistic relational model (PRM) |177j to predict 
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ratings by utilizing other meta-data including age, occupation and gender of users, or 
category, price and origin of objects. Although this comprehensive information generally 
improves recommendation results, to determine a valid dependency structure of meta-data 
is a non-trivial task in PRM. 

5.3. Probabilistic Latent Semantic Analysis (pLSA) 

Probabilistic latent semantic analysis (pLSA) is similar to LSA in the sense that hidden 
variables are introduced in to explain the co-occurrence pairs of data. Unlike the algebraic 
SVD employed in LSA, pLSA is a statistical technique based on a probabilistic model. Well 
developed inference methods including likehood maximization |18Uj and Gibbs sampling 
[181] can thus be employed in pLSA. pLSA models the relations between users and objects 
through the implicit overlap of genres, as compared to the two-sided Bayesian clustering 
where each user and object belong to a single specific category. In pLSA, the co-occurrence 
probability P(i, a) of user i and object a is expressed using the conditional probability given 
a hidden variable k 

K 

P(i, a ) = P{i\k)P{a\k)P{k). (69) 

k=l 

Since P(i\k)P(k) = P(k\i)P(i), this can be written as 

K 

P(i,a) = P(i)J2P(a\k)P(k\i) (70) 

k=l 

which leads to the condititonal probability P(a\i) of an object a to be collected given the 
user i, 

K 

P(a\i) = ^P{a\k)P(k\i) ) (71) 

k=l 

which is already a quantity useful for personalized recommendation. Unlike the Bayesian 
clustering approach where the co-ocurrences of users and objects are characterized by the 
coupled probability P xy between the classes, users and objects are rendered independent 
in pLSA given the hidden variables — the co-ocurrence probabilities are factorized. Our 
task is to obtain suitable forms of P(a\k) and P(k\i) which provide accurate predictions 
of collected and recommended objects through P(a\i). We note that instead of P(a\i), 
P(i,a) can also be expressed as 

K 

P(i,a) = P{a)^P{i\k)P{k\a) (72) 

k=l 

to obtain P(i\a) which can be of interest for some purposes. 

To obtain P(a\k) and P(k\i), one can adopt a variational approch described in |172] 
to maximize the per-link log-likelihood of the observed dataset which is given by 

1 1 / K \ 

L ^°) = ^E lo s p («i z ) = e 5> g E^iw*!*)) (73) 

(i,a) (i,a) ^ k=l ' 
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with respect to vectors 4> and 6 which parametrize P(a\k) and P(k\i) by P(a\k) = 
and P(k\i) = 9^ . Here E is the total number of user-object links. We remark that the sum 
over (i, a) includes only the observed user-object pairs. We then employ the expectation 
maximization (EM) algorithms |180] to find the value of 9 that maximize L((f>,6). To 
achieve the goal, one can introduce the variational probability distribution Q(k\i,a) in 
Eq. (73) for each observed pair (i, a), with the constraint J2k=i Q{k\h a ) = 1> which allows 
us to rewrite L(cf>,0) as 



(i,a) k=l 

with the inequality is justified by Jensens' inequality. J-(Q, <fi, 6) can be written as 

HQ, 0, 0) = i £ { Q(k\i, «) log[P(a\k)P(k\i)] + S ia (Q) } (76) 
(i )OC ) < k=i ' 

with Si a (Q) being the entropy of the probability distribution Q for the pair (z, a) which is 

S'ia(Q) - - ^2Q{k\i,a)logQ{k\i,a). (77) 
fc=i 

Since J 7 serves as the lower bound of likelihood function L((f>, 6), we maximize J 7 with re- 
spect to Q, (p and 6. The expectation maximization algorithm thus maximizes J 7 by finding 
the optimal Q, 0, 6 alternatively. We first obtain the distribution Q which maximizes J 7 
by assuming a particular form of P(a\k) and P(k\i), beholding <p and 6 constant. Max- 
imization of J 7 in this step is subject to the normalization of Q(k\i, a) for every observed 
user-object pair which leads us to the Lagrangian 

c{q, e t ) = f(q, 4> tl e t ) + Aia ( J2 W> «) - x ) ( 78 ) 

(i,a) ^ k=l ' 

where <j> t and 6 t denotes respectively 4> and 6 after t iteration steps, or equivalently, 
P t (a\k) and P t {k\i). C(Q,<p t ,6 t ) can then be differentiated to obtain the optimal Q for 
every observed user-object pair in the form 

P t (a\k)P t (k\i) 

This optimal Q is obtained from t and <j> t , thus we label it as Qt- 
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(*) 



We then proceed to obtain the distributions P(a\k) and P{k\i), i.e. the value of <^ a 
and 9^ , by assuming Q is held fixed at Q = Q t obtained from Eq. (79). Due to the 



normalization of P(a\k) for all a and P(k\i) = 1 for all i, the corresponding Lagrangian 
has the form 



f M \ f K 

C(Q t , 6) = JF(Q t , 6) + J2^[J2 P ^\ k ) ~ 1 ) + E M E P ^ ~ 1 ) • ( 8 °) 

i- ^ a=l ' i ^ k=l 



After differentiation, the optimal P(a\k) and P(k\i) can be found 

p ( \u\ J k ) Ei Qt(k\i,a) 

Pt+i{a\k) = <j) / +1 = . ^ „ , 81 

where the summation involving % and a runs only over the observed user-object pairs. The 
optimal P(a\k) and P(k\i) are obtained from Q = Q t . We label them as Pt+i(a\k) and 
Pt+\(k\i), respectively, because they constitute the basis for the next iteration step where 



Qt+i is found. After stationary values of Q t , 4> t and 6 t are found, Eq. (71 ) is used to obtain 
personalized recommendations. 

As the above pLSA model considers a multinomial distribution (p^ which can be used 
to model only binary preferences, one may consider the generlized pLSA [186J which allows 
for numeric ratings. The fast increasing number of independent variables used by pLSA 
(there are K(N + M) of them) and the cold-start problem for new objects can be alleviated 
by asssuming prior distributions on and 0^ % \ such as Dirichlet priors discussed in the 
following section |173] . 

5-4- Latent Dirichlet Allocation (LDA) 

Latent Dirichlet Allocation (LDA) |173] is similar to pLSA in the sense that hidden 
variables are present in a probabilistic way. While pLSA does not assume a specific prior 
distribution over P(k\i), LDA assumes that priors that have the form of the Dirichlet 
distribution. LDA was applied to predict review scores based on the content of reviews 
[187] and to uncover implicit community structures in a social network |188j . It can be 
also extended to include general meta-data of users and objects |189j . 

For each user i there is a distribution P{k\0^) where 8® is a if- dimensional multinomial 
distribution with 9% is the probability of user i belonging to the latent class k, i.e.P(k\i) = 
9% . Unlike pLSA, the variable 9® in LDA has a Dirichlet prior distribution with a K- 
dimensional parameter a. The probability of observing user % with the collected object set 
{a}i is 



d9Wp{9®\a) HE P K#)^(^ (i 

_ ,, — 1 ; i 



K 

H=l k=l 
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system 



Figure 8: A so-called plate notation representing the LDA recommendation model: a and b represent the 
model's parameters at the system level, which characterize respectively the Dirichlet distribution of 6 
and cf) { a \ where P(k\i) = 6 { * ] and P(a\k) = <f>^ . 



(0 



where fc, is the number of objects collected by user i and is the /i-th object collected 
by i. The Dirichlet prior distribution P(9^'\a) is 



a 



r( Y.k=l a k) TT (M)\a h -1 
1 lfc=l 1 \ a k) k= i 



J4) 



where is the Gamma function, the constraint Ylf=i @k' = 1 holds and 9^ > for all 
i and k. Prior distributions for all 9^ share the same parameter a. The LDA model can 
be represented by a so-called plate notation which is shown Fig. [8] The probability of the 
observed data {(i, a)}, 



k K 

P({(i,a)}\a,4>) = I[J <W«P(0«|a) f[^2P( ai Jk)P(k\9^ 



<-fi=i k=i 



15) 



depends on the parameter vectors a and b. 

In the original formulation of LDA |173j . P(a\k) is given by a multinomial distribution 
parametrized by <p such that P(a\k) = 0^ as in the case of pLSA. Some variants of LDA 
consider P(a\k) following a Dirchlet prior distribution, which is known as a smoothed ver- 
sion of LDA |190j . In this case, b which characterizes the prior distribution of P(a\k) = 



in Eq. (83) corresponds to a M-dimensional parameter of the Dirichlet prior distribution 



P(0 (fc) |6) 



r (£f=i &«) 



M 



n",r(6«) 



n m 



b a -l 



16) 



a=l 



Prior distributions for all share the same parameter b. 

In order to obtain rating predictions for unobserved user-object pairs, one has to find 
P({(f)^}, {9^}\{(ia)}i a,b) and use {^) } an d {9^} to make personalized predictions 
for user i. For instance, the predicted score for an unobserved pair (i, a) is given by 
fia = Yl?k=i 0« Ms ■ Since the distribution of {<f>^}, {9^} is in general intractable, one can 
follow |173j and adopt a variational approach to maximize likelihood — similarly as we did 
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in the case of pLSA. Here we describe the Gibbs sampling for the smoothed LDA as an 
alternative inference method |182j . The procedures are similar to the Gibbs sampling in 
Bayesian networks, except that one assigns a latent class for each user-object pair, instead 
of classes for individual users and objects. 

To derive an equation for Gibbs sampling in LDA, we first assign an index \i for each 
observed user-object pair (i, a) so that k^ is a latent variable drawn from the multinomial 
distribution 9^^ and is an object drawn from the multinomial distribution <p^ k ^ . As 
shown in |182^ I191j . the inference method is much simplified by assuming a symmetry 
Dirichlet prior with homogeneous a and b, i.e.ai = ■ ■ ■ = := a and b± — ■ ■■ — bu '■= b. 
Then one can show that the conditional probability for an observed user-object pair // 
characterized by is given by 



P(£y = k\{k^]^^,{{i^a^)}) oc P(<vlV = k, {A^}-^', {(fy, c^)}-^) P (/y = k\{k^}-^) 

n -u',a.., + b nj„, h + a 



(*) , h (v) 
« m ... "77^ (87) 



where all n_ M /'s are evaluated in the absence of pair y!\ n_L is the number of observed 



(k) 

pairs charaterized by latent class k, n_^, a is the number of observed pairs of object a 

charaterized by latent class k, n_ , is the number of observed pairs of user i (degree of i) 

and i k is the number of observed pairs with user i characterized by latent class k. 

The Gibbs sampling process runs as follows. We first start with a random assignment 
of latent class to each observed user-object pair and successively pick a random user-object 



pair to update its latent class according to Eq. (87). This corresponds to a shift of the 
system state from one to another; n^ k \ ria\, n^V) and n^"' are updated after each new 
assignment of the latent class. After a sufficient number of iterations, one can sample (j)^ k > 
and 0W at a regular time interval 

^fc) = n a +b = + a 



^ n W + M6' k n®+Ka K ' 

With these samples of cf>^ and the predicted score for an unobserved user-object pair 
can be computed as 

K 

r ~- = EE^+^)^ c +*n ( 89 ) 

t k=l 

where t c and T are respectively the convergence time and the sampling time interval. As 
in the Gibbs sampling of Bayesian clustering, states of (ft® and <?W can be stored and use 
to compute the predicted scores later. When the input data is large, one can distribute 
the Gibbs sampling to several processors to shorten the computation time [192J. 
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6. Diffusion-based methods 



Similarly as the classical PageRank algorithm |152j brought the order to the Internet 
by analyzing the directed network of links among web pages, one could aim to obtain 
recommendations using a network representation of the input data with user preferences. 
The algorithms presented in this section are all based on specific transformations (projec- 
tions) of the input data to object-object networks. Personalized recommendations for an 
individual user are then obtained by using this user's past preferences as "sources" in a 
given network and propagating them to yet unevaluated objects. 

6.1. Heat diffusion algorithm (HDiff) 

This algorithm is based on projecting the input data on a simple object network char- 
acterized by a symmetric adjacency matrix A with elements either one (for similar objects) 
or zero (for dissimilar objects). It recommends objects to an individual user by a process 
motivated by heat diffusion: objects liked and disliked by this user are represented as hot 
and cold spots respectively, and recommendation is made according to the equilibrium 
"temperature" of the nodes in the networks [193] . The discrete Laplace operator of the 
network has the form L = ljy — D _1 A where D is the network's diagonal degree matrix with 
elements D a p = k a 5 a p. This operator is a discrete analog of the heat diffusion operator 
— V 2 which is well-known in physics. The resulting temperature vector for user i, hi, is 
the solution of the heat diffusion equation 

\-hi = f t (90) 

and has both variable part (which we seek) and fixed part. Fixed elements of hi correspond 
to objects already evaluated by user i; they are set to 1 (objects liked by the user — they act 
as heat sources) or (objects disliked by the user — they act as heat sinks). Mathematically 
this corresponds to the Dirichlet boundary condition. The external flux vector f i is non- 
zero only for objects evaluated by user i and allows for fixed values attributed to sources 



and sinks. Eq. (90) can be solved using the Green's function method and the involved 
computational cost can be lowered by utilizing various algebraical properties of L |193j . At 
the same time, it is straightforward to find the equilibrium hi iteratively by setting the 
initial temperature vector h\ to contain only the fixed heat sources and sinks and iterate 

h^ +1) = L'h^ (91) 

where L' { is the same as the Laplace operator above except that it keeps elements in hi 
corresponding to Vs evaluated objects unchanged. 

Given this mathematical framework, it is still an open question how exactly to apply it 
to a given rating matrix R. The procedure adopted in |193j is that the Pearson's correlation 
coefficient for ratings of objects a and (3, C a p, is compared with a specified threshold C t 
and A a p = 1 when C a p > Ct and it is zero otherwise. The threshold Ct is set so that the 
resulting number of links is the same as the number of non-zero entries in R T R (which is 
equivalent to the number of object pairs co-evaluated by at least one user). The boundary 
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Figure 9: Graphical representation of the links created by a user who has rated only objects 1 (rating 5), 
2 (rating 3) and 3 (rating 4). 

condition for user % is composed of the ratings given by this user to different objects (note 
that the heat diffusion equations above are not constrained to the binary case like/dislike- 
hot /cold and can be used to arbitrary- valued ratings). 

6.2. Multilevel spreading algorithm (MultiS) 

This algorithm can be applied when ratings ri a are given in a discrete scaleQ For 
example, Amazon.com employs a five-star scale where one and five stars correspond to 
the worst and best rating, respectively. For the sake of simplicity, we assume a five-level 
rating scale in the rest of this subsection (generalization to a different number of levels 
is straightforward). As in other diffusion-based methods, the recommendation process 
starts with the preparation of a particular object-object projection of the rating data. To 
eliminate the loss of information in the projection, instead of merely creating a link between 
two objects, in this multilevel spreading algorithms, links are created between ratings given 
to a pair of objects |194j . As a result we obtain 5 2 = 25 separate connections (channels) for 
each object pair. This is illustrated in fig. [9] on an example of a user who has rated three 
movies; as a result, three links are created between the given movies. When all data are 
processed, contributions from all users accumulate and a weighted object-object network 
is created. Note that splitting the connection between two objects into multiple separate 
channels aggravates the data sparsity problem and can lead to inferior performance of this 
algorithm in some cases. 

Between a given pair of objects we create multiple links which are conveniently stored 
in a 5 x 5 matrix. By representing integer ratings rj a with column vectors in 5 dimensional 
space (unknown rating with v ia = (0, 0, 0, 0, 0) T , rating r ia = 1 with v ia = (1, 0, 0, 0, 0) T , 
rating r ia = 2 with v ia = (0, 1,0,0,0) T , etc.), connection matrix for objects a and (3 has 



7 It is also possible to apply a binning procedure to continuous- valued ratings and hence transform them 
into an integer scale but that has not been used in practice yet. 
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the form 

M t 

1=1 1 

Weights of individual users are inversely proportional to their number of evaluations — this 
aims to compensate the quadratically growing number of links created by an individual 
user, ki(ki — l)/2, hence in total we get a linear relation between user's number of evalua- 
tions and the cumulative weight of user's ratings]^] 

Matrices W a/ 3 form a symmetric matrix W with dimensions 5N x 5iV. By the column 
normalization of W we obtain an asymmetric matrix Q which describes a diffusion process 
on the underlying network with the outgoing weights from any node in the graph normalized 
to unity (if one chooses the row normalization instead, the resulting process is equivalent to 
heat conduction in the network; for a mathematically-oriented review of flows in networks 
see |195j ). Elements with large weights in Q represent strong patterns in user ratings 
(e.g., most of those who rated movie X with 5 gave 3 to movie Y). Similarly as for other 
diffusion-based methods, we obtain personal recommendations for user i by combining the 
aggregate matrix Q with opinions already expressed by i. This opinions are stored in a 
5iV-dimensional vector hf 1 (the first 5 elements correspond to object 1, next 5 elements 



to object 2, etc.). As in Sec. 6.1, we seek the stationary solution of the equation 

VLihi = hi, (93) 

where f2j is the same as Q except it keeps the elements corresponding to the objects 
evaluated by user i unchanged. Resulting vectors h[ n ^ contain information about objects 
unrated by user i. This information can be used to obtain rating predictions by the 
standard weighted average. For example, if for a given object in hi we obtain the 5- 
tuple (0.0,0.2,0.4,0.4,0.0) T , the rating prediction is 0.2 x 2 + 0.4 x 3 + 0.4 x 4 = 3.2. 
Numeric tests presented in |194] suggest that h± is a good enough predictor. Sophisticated 



techniques to avoid multiple iterations in Eq. (93) [ 194] are hence not fundamental for 
practical applications of this algorithm. An alterative way is to map each object to several 
channels with the number of channels being equal to the number of different ratings. So 
that if a user i has collected an object a with rating 2, her will only connect to a^ 2 \ After 
that, one can directly apply the probabilistic spreading process (see the next subsection) 
to obtain the similarity and then integrate it into the collaborative filtering framework to 
obtain better recommendation [196J. 

6.3. Probabilistic spreading algorithm (ProbS) 

This algorithm is suitable for data without explicit ratings, i.e. only the sets of object 
collected/visited by each user is known. Elements of the rating matrix R are hence = 1 
(when user i has collected/ visited object a) or r ia = (otherwise). More explicit preference 



8 Since the users who have evaluated only one object add no links to the object-object network, the 
divergence of the weight — 1) at ki = 1 is not an obstacle. 
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Figure 10: Illustration of the ProbS's resource-allocation process in a simple bipartite network. The 
assigned resources first flow from object-nodes (circles) to user-nodes (squares) and then return back to 
object-nodes. 



indicators can be easily mapped to this form, albeit losing information in the process, 
whereas the converse is not so. 

The spreading recommendation algorithm proposed in [ IQTj is based on a projection 
of the input data (which can be represented by an unweighted user-object network) to 
an object-object network. In this projection, the weight W a p can be considered as the 
importance of node a with respect to node (3 and in general it differs from Wp a . A suitable 
form of W a/ 3 can be obtained by studying the original bipartite network where a certain 
amount of a resource (a scalar quantity which reflects, for example, social influence in a 
recommender system) is assigned to each object node. Since the network is unweighted, 
the unbiased allocation of the initial resource is split equally among all its neighboring 
user-nodes. Consequently, resources collected by user-nodes are equally redistributed back 
to their neighboring object-nodes. This is equivalent to random walk from the initial 
source nodes to a distance of two in the user-object bipartite graph. An illustration of this 
resource-allocation process for a simple bipartite network is shown in Fig. 10 

Denoting the initial object resource values as x a , the two resource-distribution steps 
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can be merged to one and the final resource values read x a = J2p =1 W^xp where 

M 

W P R = — V (94) 

r i=i 

The superscript P stands for "probabilistic" and serves to distinguish the current spreading 
process from its modifications that we shall discuss later. Note that the resulting N x N 
transition matrix is column normalized with W^p representing the fraction of the initial 
/3's resource transferred to a. Recommendations for a given user i are obtained by setting 
the initial resource vector h l in accordance with the objects the user has already collected, 
that is, by setting hp = r^. Recommendation scores of objects are then obtained by 

N 

0=1 

Objects recommended to user i are then selected according to h a (the higher the value, 
the better). 

The original ProbS algorithm has been later improved in various directions. In |110j . the 
authors suggested a heterogeneous distribution of the initial resources among the nodes, 
h a — a>iak 9 a , and showed that when the parameter 9 is tuned appropriately, it can help 
increase the accuracy of recommendations and it also makes the recommendations more 
personalized. The optimal value of 9 is close to -1 (e.g., -0.8 to -1.0 for MovieLens, de- 
pending on the size of the selected data set |110U197j ). indicating that each item should be 
assigned more or less the same amount of total initial resource. In [111] , it was proposed 



to construct the transition matrix as W + rj\N 2 where W is defined by Eq. (94) and rj is 
a free parameter. By effectively removing redundant correlations (the optimal value of 
7] is usually negative), this method succeeded in outperforming ProbS and other derived 
methods in terms of accuracy and diversity of recommendations. Similar method can also 
be applied in designing more accurate similarity index for collaborative filtering [198J and 
link prediction |148j . In |199] . they proposed to increase the method's accuracy by giv- 
ing preference to objects with degree similar to the average degree of objects collected by 
a given user. In addition, the degree correlation [200], users' tastes |201j . user behavior 
patterns [52J can also be accounted to improve the recommendation accuracy. 

Finally, we introduce a preferential diffusion method, which was proposed to enhance 
the algorithm's ability to find unpopular and niche objects |113j . The basic idea is that at 
the last step (i.e., diffusion from users to objects), the amount of resource that object a 
receives is proportional to k £ a where e < is a free parameter. When e — 0, this method 
is identical to ProbS. It was shown that PD not only provides more accurate recommen- 
dations than ProbS but it also generates more diverse and novel recommendations by rec- 
ommending relevant unpopular items. The authors further compared the intra-similarity 



of recommended items with that of the whole system. As shown in Fig. 11, they draw a 



line that divides the parameter space into two phases: In the left region, especially the 
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area corresponding to smaller e and larger L, PD is like a concave lens that broadens the 
user's vision, while in the right region, corresponding to larger e and smaller L, PD likes 
a convex lens that narrows the user's vision. Of course, we prefer the former case since it 
embodies the merit of personalization. 

Note that the resource-allocation process can also be applied in unipartite networks. 
Considering a pair of nodes i and j, i can send some resource to j with their common 
neighbors playing the role of transmitters. In the simplest case, we assume that each 
transmitter has a unit of resource, and will equally distribute it to all neighbors. This 
defines a similarity index (called resource allocation index |lU0j ) between nodes: 



h 

zer\nr, 



where Yi denotes the set of i's neighboring nodes. Recent works showed that despite its 
simplicity, this index performs better than many known local similarity indices in link 
prediction [100] . community detection |202j . and the characterization of weighted trans- 
portation networks 



6.4- Hybrid spreading-relevant algorithms 

To answer the need of diversity in algorithm-based recommendation (see Section 2.2 
for a discussion of this problem and possible solutions), a hybrid algorithm was proposed 
in [37j which combines accuracy-focused ProbS with diversity-favoring heat spreading. As 
in probabilistic spreading, heat spreading works by assigning objects an initial level of 
"resource" denoted by the vector h, and then redistributing it via the transformation 
h = \N H h. The transition matrix of heat spreading reads 

M 
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Figure 12: (Color online) Comparison of ProbS and HeatS, where the target user is marked by a star and 
the collected objects are of initial resource 1. The final scores after ProbS and HeatS are listed in the right 
sides of plots (c) and (e) . 



which, in contrast to W obtained with Eq. (94), is row-normalized and corresponds to a 



heat diffusion process (thus named HeatS) on the given user-object network. 



Figure [12] illustrates the procedures of ProbS and HeatS. According to the final scores, 
ProbS will recommend the third object to the target user, while HeatS will recommend the 
second. Generally speaking, HeatS is able to find out unpopular (i.e., of low degree) objects, 
yet ProbS tends to recommend popular objects and thus lacks diversity and novelty. On 
the other hand, recommendations obtained by HeatS are too peculiar to be usefuQ To 
integrate the advantages from both two algorithms, in [37] they proposed an elegant hybrid 
of \N H and W p (named HybridS) in the form 



M 



(97) 



where A = gives pure heat spreading and A = 1 gives pure probabilistic spreading. As be- 
fore, the resulting recommendation scores for user i are computed as h % a = YjB=i ^a^ P ^ 
where the initial resource values are set as fig = r^. Results shown in [37] show that 
this combination of two different algorithms allows us not merely to compromise between 
diversity and accuracy but to simultaneously improve both aspects. By tuning the degree 



9 Compared with the similarity-based methods and ProbS, the AUC value and precision of HeatS arc 
considerably lower. Therefore, using HeatS alone seems not proper. Recent works |2041 1205] indicate that 
some weighted version of HeatS could also give highly accurate recommendation. 
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of hybridization, represented by the parameter A, the algorithms can be tailored to many 
custom situations and requirements. Note that, the parameter A is not necessarily to be 
the same for different users and objects, namely each user % can have her own parameter \ 
or each object a can have its own parameter X a , in this way, the algorithmic performance 
can be further improved [206] . And the initial resource distribution is not necessarily to 
be homogeneous, and by introducing the heterogeneity, the algorithmic performance can 
be improved [207J. 

Similar to the Hybrid spreading algorithm, B-Rank combines a random walk process 
with heat diffusion but it does so for data containing explicit ratings [208]. The transition 
matrix is introduced in the form 

1-5 M 

Pa/3 = — ^ W i X ia X ifS (98) 

Ua i=l 

where Xi a = 1 if user i has rated object a and Xi a = otherwise, Wi is the weight of user 
i and term which makes the matrix P row-normalized. User weights Wi are set 

all identical but can be potentially set heterogeneous, for example, to give more weight to 
reliable users or suppress spammers (these possibilities have not been studied yet). Due to 
the term 1 — 5 a/ 3, P aa = and the corresponding random walk on the weighted object-object 
network is thus non-lazy (there is no possibility to return to the initial node after one step). 
Using the vector with ratings of user i, h l a = r ia , object scores corresponding to the forward 
and backward propagation of h % are computed as F l = P T h l and B l = Ph l (forward and 
backward propagation correspond to random walk and heat diffusion, respectively — for 
details see |195j ). The final score of object a is obtained as f l a = F % a B % a (the higher, 
the better). Note that this algorithm does not aim at predicting missing scores (hence 
the traditional measures of recommender systems, MAE and RMSE cannot be applied to 
evaluate it). Instead, it provides a personalized ranking (hence the name, 'B-Rank') of 
objects for each user. 

The above-mentioned diffusion processes can be applied in computing the similarity 
between users or items, and then integrated into the similarity-based methods. Liu et al. 



[199] defined the similarity between users according to Prob^ 10 , and showed that based 
on the routine collaborative filtering algorithm, the proposed similarity index improved 
the algorithmic performance compared with the Pearson similarity index. Pan et al. |209] 
applied the Hybrids process to define similarity, which outperforms the cosine similarity 
under the framework of collaborative filtering. 



Similar to Eq. 



94 



but the spreading process starts from user side and ends at user sides. 



51 



7. Social filtering 

Recommendations made by a recommender system are often less appreciated than those 
coming from our friends |210] and social influences may play a more important role than 
similarity of past activities \211\ I212j . In addition, accuracy of recommendation can be 
improved by analyzing social relationships, such as coauthorships in academic literature 
recommendation |213] and friendships and memberships in product recommendation |214j . 
Many real systems, such as Delicious.com and Facebook.com, allow users to recommend 
objects to their friends. Similarly, users can subscribe to articles from selected bloggers in 
blogging sites ( Twitter, com and others) as well as to news alerts from information dissem- 
ination systems (Elesvier.com and others). In this chapter, we will first present empirical 
evidences that demonstrate the presence of social influence on information filtering. Then 
we will introduce two basic ways social filtering are employed in recommendation: by 
quantifying and utilizing trust relationships between users and by using the opinion "taste 
mates" to select the content to be recommended. 



7.1. Social Influences on Recommendations 

Social influences, also called the word-of-mouth effect in the literatures, are known to 
be crucial to many sociometric processes, such as decisions making, opinions spreading and 
the propagation of innovation and fashion |215[ I216[ I217|, I218j . Scientists have been aware of 
the commercial values of social influences for a long time |219j . yet large-scale applications 
for commercial purpose only emerged when the Internet era began. Besides, the availability 
and the great variety of data provide us good opportunities to quantitatively understand 
social influences |220J . This section focuses on the social recommendations, whose effects 
can be roughly divided into two classes: one is on users' prior expectations, leading to the 
increase of sales; another is on users' posterior evaluations, resulting in the enhancement 
of the user loyalty. 

Positive effects of social recommendations on prior expectation have already been 
demonstrated in a number of real examples. They are found in a wide range of systems, 
including product reviews [2211 1222] . e-mails [223], blogs |224j and microblogs [225]. In 
[226] , the authors studied the effects of social influences on purchase preference: users of an 
e-commerce system were given the option to recommend an item to their friends through 
e-mails after purchase. The first person to purchase the same item through a referral link 
from e-mails got a 10% discount, and when this happens, the recommender will receive a 
10% credit. As shown in Fig. 13, the purchase probability for a DVD grows remarkably 
with the increasing number of received recommendations from friends on this DVD. There 
is a saturation at about 10 recommendations, after which the purchase probability does not 
increase any more. In other examples, social influences can be much more complicated. In 
they reported a similar experiment with book sales where in contrast to the common 



sense, recommendations had little or even negative effect on the purchase probability. So- 
cial influences may also vary across topics and items: |227] showed that different tags and 
topics spread on Twitter differently and |228j found that strength and direction of social 
influence are topic-dependent. A recent work shows that the mutual interaction between 
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Figure 13: Probability of buying a DVD given a number of incoming recommendations (taken from [226 ). 



opinions of past viewers and potential future viewers leads to a complex dynamics that 
agrees qualitatively with movie popularity behavior seen in real systems 



In comparison, the issue about how social recommendations affect users' posterior eval- 
uation received less attention. [230J empirically analyzed two web sites, Douban.com and 
Goodreads.com, where millions of users rate books, music and movies, and share their 
ratings and reviews with friends and followers. On these social network sites, the phe- 
nomenon that users recommend favorites to friends and followers plays an important role 



in shaping users' behaviors and collections. Fig. [14] compares the probability distribution 
of ratings on items in Douban with and without recommendations (the result is very sim- 
ilar in Goodreads). This demonstrates that an individual is more likely to give a high 
rating to an item with word-of-mouth recommendations, compared with items without 
recommendations. 

There are also indirect evidences about the positive social influences, for example, in 
Twitter, statistically a tweet will spread to about 10 3 users if it gets retweeted |231j . and in 
TaobacpT, the communication between buyers is a fundamental driving force for purchasing 
activity [232] . 

Many ingredients could result in positive social influences in online recommendation. 
Firstly, the word-of-mouth influences and role-model effects from social mates are very 
strong in offline society, for example, an experiment in Nepal |233] shows that on average 
the probability of a woman to use a novel menstrual cup Take-Up will increase by 18.6% if 
one more of her friends has used Take-Up. Secondly, the friendship network and interest- 
based network are strongly correlated with each other |234] . and friends tend to visit the 
same items and vote them with similar ratings 



Taobao.com is a Chinese consumer marketplace that is the world's largest e-commerce website. 
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Figure 14: The probability distributions ol ratings, posted with (solid line) or without (dashed line) a 
word-of-mouth recommendation in Douban (taken from [230 ). 

7.2. Trust- Aware Recommender Algorithms 

It is the basic paradigm of recommender systems that when computing recommenda- 
tions for an individual user, evaluations of the others are not weighed equally and preference 
is given to those who have similar rating patterns as the given user. This approach neglects 
an important facet of the evaluation process: it is not only personal tastes but also social 
relationships and the quality of evaluations that differs from one user to another. To make 
a better use of social relationships, various recommendation algorithms relying explicitly 
on trust or user reputation [236, 237J have been developed and applied by many commer- 
cial web sites such as eBay.com |238j . Trust can be used instead of user similarity [239J, 
in combination with collaborative filtering to help deal with data sparsity and the cold 
start problem |24U[ 12411 1242] , or it can help to further filter recommendations by priori- 
tizing those approved by trusted sources (see |243] for a review). The use of reputation 
in recommendation is further supported by the evidence that trust correlates with user 
similarity [244] . meaning that by introducing trust we are unlikely to conflict with users' 
interests and preferences. As noted in |245] . even an imperfect reputation system may be 
beneficial as (i) it provides an incentive for good behaviors, (ii) imposes costs on partic- 
ipants to get established, and (iii) swiftly reacts to bad behavior. The use of trust and 
reputation has also its drawbacks, which includes: (i) time consuming computation, (ii) low 
incentives for users to provide the required feedback, (iii) privacy concern for data of trust 
relationship, and (iv) low availability of trust datasets for tests of algorithms. However, 
without algorithms for trust and reputation, online transactions would be dramatically 
affected, if not halted. 
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The difference between reputation and trust is that while the former characterizes 
general beliefs about a user's trustworthiness, the latter concept is personal and relates 
to a user's willingness to rely on the actions of a given user. The terms local and global 
trust metric are sometimes used instead of trust and reputation, respectively. To establish 
trust or reputation, one may let the involved participants to rate each other and use these 
evaluations to derive trust or reputation scores |237j . When mutual user evaluations are 
given, an initial seed of trusted users can be used to find all the trustworthy users as 
in the classical Advogato trust metric (see http : // www . advogato . org/ t rust-metric . | 
html). Other popular trust metrics, such as PageRank |152] and Eigentrust |246] . use the 
spreading activation approach by which nodes are intially loaded and then propagate their 
load within the network |247] (diffusion-based recommendation methods are presented in 
Section [6] also use this approach). When trust relations between the users are weighted 
(with trust values 1 and representing total trust and distrust, respectively), the plain 
trust propagation can be shown to be insufficient — the Appleseed algorithm solves this 
problem by creating virtual trust edges for backward propagation [248J. 

It is a great disadvantage of reputation-aware recommender systems that they usually 
require substantial input on the the user side to evaluate trust and reputation. Some 
trust-aware reputation algorithms hence tried not to rely on explicit evaluations of other 
users. In |249j . they propose to detect noisy ratings by comparing the actual ratings with 
the predicted ratings obtained by a recommendation algorithm, with data only from a 
set of implicitly trusted users. This "reputation of ratings" can in turn be used to build 
reputation of users [243J. A different approach was proposed by [250] where the authors 
use the information contained in social relationships between the users (which may stem 
from users' family or friendship relations), the transitivity of trust (as in |251j ). and the 
propagation of users' queries in social networks. When computing recommendations for 
a specific user i, the greatest weight is hence given to the users who can be connected 
with user i in the social network by a short path with high trust values along its edges. 
The proposed system is shown to assign correctly trust values and self-organizes to a 
state producing highly accurate recommendations (when compared to a simple benchmark 
strategy when one of the recommendations from peers is chosen at random). 

7.3. Adaptive Social Recommendation Models 

While the above described trust-aware systems make use of existing social relationships, 
adaptive social recommendation models build a network of users based on their evalua- 
tions. In |252j . the authors proposed a model where the recommended items spread over 
the network similarly as an epidemics [62} 1253] or rumor |254l ITU] spreads in a society. 
Simultaneously with this spreading, the network of users evolves and adapts to best cap- 
ture users' similarities. This epidemic-like spreading of a successful item is of particular 
importance in the case when individual items swiftly loose their relevance [255J — as it is 
the case for news stories, for example — because it combines personalization with the speed 
of access. It is very different from the currently popular services such as digg.com and 
reddit.com which still rely on centralized distribution of items where only those of very 
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Figure 15: Illustration of the news propagation in an adaptive network model [252) . User i added a new 
news, which is automatically considered as approved (A) and sent to users ji,j2,j3 who are i's followers. 
While user dislikes (D) the news, users j\ and j% approve it and pass it further to their followers k\ , . . . , fcs 
who have not evaluated the news yet (which is denoted with question marks). User receives the news 
from the authorities j± and j'3, yielding the news's recommendation score Sj 1 fc 4 + Sj 3 k 4 - At the same time, 
user k$ receives the news only from the authority j'3 and hence for this user, the recommendation score is 
only s j3k5 . 



general interest can become popular and be accessed by many. For a recent review of 
approaches to news recommendation see [256] . 

Here we describe briefly the model introduced in [252J. In this model, users either 
"approve" or "disapprove" the consumed items. Each user i has S sources (i.e., S other 
users from whom i receives news) and thus the system can be described by a directed 
network with a constrained node in-degree S. When a news is approved by user i, it is 
added to the recommendation lists to all i's followers (i.e., all users who have i as one of 



their sources). This spreading process is illustrated in Fig. 15. Similarity between users i 



and j is defined according to the agreement of their past evaluations 

n A I ' 1 



riA + n D \ y/n A + n D J 

where ha and no denote the number of news for which evaluations of % and j agree and 
disagee evaluations, respectively. The term I/^/ua + nn aims at penalizing user pairs with 
little overlap of evaluated news as they may seem to be a great match for each other simply 
because agreeing in evaluations of a few news. To summarize, items at the top of a user's 
recommendation list are probably recommended by multiple sources of this user or, at 
least, by a source whose similarity with this user is high. 

Apart from using user similarity in the recommendation process (the recommendation 
score of item a for user i is given by the similarity between i and the sources of i who 
approved this news), it is also crucial for updating the source-follower network. This 
updating aims at maximizing the similarity of each user with their sources. As shown in 
[252] by agent-based simulations, the system can evolve from a random initial state to a 
highly organized state where taste mates are connected and news spread effectively. The 
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original network can be improved by rewiring, through ineffective random replacements 
or computationally demanding global optimization. One can also combine simple greedy 
optimization with random assignment |257] . repeated trials |258j or exploration of the 
directed user network both in the direction of followers and sources [259] . Robustness of 
this recommendation approach can be enhanced by introducing the user reputation [25 7j . 
This adaptive evolution of the network of user-user interactions can be used to explain 
the widely observed scale- free leadership structure in social recommender systems [260J , 
and recent analysis suggested that users could get better information by selecting proper 
leaders in social sharing websites [26 lj. 
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8. Meta approaches 

Input data for recommendation can be extended far behind the traditional user-item- 
rating scheme. In this section, we briefly review the so-called meta approaches where 
additional information of various kind (tags assigned to items by users or time stamps of 
past evaluations) enters the recommendation process or simply several recommendation 
methods are combined together (either in an iterative self-evaluating way or by forming a 
hybrid algorithm). As possibilities for extensions are relatively easy to find, this direction 
has seen high activity over the past years. 

8.1. Tag-aware methods 

In the past decade, the advent of Web 2.0 and its affiliated applications brought us 
a new paradigm to facilitate the user-generated creation on the Internet. One such ex- 
ample are user-driven platforms that allow users to store resources (bookmarks, images, 
documents, and others) and to associate them with personalized words, so-called tags. 
The resulting ternary user-object-tags structure, so-called folksonomy, represents a rich 
data structure that is of interest for computer scientists, sociologists, linguists, and also 
physicists When viewed from the perspective of an individual user, these tags con- 
stitute a personalized folksonomy [262J where tags, although only simple words, contain 
highly abstracted yet personalized information. Different from other kinds of metadata 
(such as profile, attributes, and content), tags are not predefined by domain experts or 
administrators. This approach has the advantage of being scalable and requiring no spe- 
cific skills, hence allowing every individual to participate and contribute. Despite the lack 
of imposed organization, shared vocabularies were shown to emerge in folksonomies [263J, 
making them increasingly accessibile to advanced information filtering techniques. Tags 
therefore represent a simple yet promising tool to provide reasonable recommendations and 
solve some outstanding problems in recommender systems, e.g.the cold-start problem [264] . 
The social impact |265] and dynamical properties of folksonomies |266[ 1267] are expected 
to be applied to obtain trustworthy and real-time recommendations in tagging systems. 
In addition, the hypergraph |92j theory is considered to fully utilize the complete network 
structure of tagging platforms without using any hybrid methods and losing any infor- 
mation, which gives promises for generally more reliable recommendations. At the same 
time, there are also drawbacks caused by the freedom of tags: e.g. polysemy, synonymy, and 
ambiguity |268j . To alleviate these problems, advanced methods such as tag hierarchical 
clustering [269], introduction of ontologies |270] and recommendation of tags |271j were 
proposed. 

Unlike being used as a traditional filter, researches tend to apply more sophisticated 
theories and methods (e.g., social impact) in designing tag-aware recommendation algo- 
rithms. The FolkRank [272J, a modified PageRank algorithm, was proposed to rank tags 
in folksonomies by assuming that important tags are given by important users. Due to 
the success of collaborative filtering, many works were devoted to using tags to measure 
similarity among users or objects, and then fuse with the standard memory-based collab- 
orative filtering framework [273] 1274] 1275] . In [82] , the present tag-aware algorithms are 
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Figure 16: (Color online) Illustration of a user-item-tag tripartite graph consists of three users, five items 
and four tags, as well as the recommendation process described in |279j . The tripartite graph is decomposed 
to user-item (black links) and item-tag (red links) bipartite graphs connected by items. This is how the 
scoring process works for a given target user U\. (a) Firstly, highlight the items, Ji, I 3 , I 5 , collected by 
the target user U\ and mark them with unit resource. In the depicted case we have: fi 1 = fi 3 = fi 5 = 1, 
and fi 2 = fi 4 = 0. (b) Secondly, distribute the resources from items to their corresponding users and tags, 
respectively: fu 3 = fa X § + fa X § + fa X | = 1 X 5 +0+1 x \ = 1 and fa = fa X § + fa X \ + // 4 X | = 
lx| + lx| + 0= |; (c) Finally, redistribute the resources from users and tags to their neighboring items: 
fl = fu 2 x I + fa x \ = I x § + 1 x i = £ ^ f% = fa x i + fa x § = 1 x i + | x | = 



classified as follows: 

1. Topic-based models. They implement probability-based methods such as pLSA and 



LDA (see Sections 5.3 and 5.4) to extract latent topics from the available tags in the 
user or object space and then produce recommendations using classical probability- 
based models [2761 127711278] . 

Network-based models. They implement graph theory-based methods such as ProbS 



see Sec. 6.3) to represent tags as nodes in a tripartite user-object-tag network and 



apply a diffusion process to generate recommendations |279] (see Fig. 16). 
3. Tensor-based models. They implement tensor factorization [280] to reduce the ternary 
relation into low-rank feature matrices, alleviate the sparsity problem in large-scale 
datasets and ultimately provide personalized recommendations |281| 1282] . 

8.2. Time- aware methods 

Nowadays, huge quantities of information emerge every second. We receive news from 
various media, such as newspapers, TV programs, websites, etc. Due to its timeliness and 
in particular its convenience, more and more people prefer to read news online {e.g., using 
RSS feeds) instead of from traditional media like newspapers. However, given the enormous 
amount of online news, one challenging issue is the novelty of news recommendation, i.e., 
how to appropriately and efficiently recommend news to readers, matching their reading 
preferences as much as possible. The analysis of data from a popular platform for sharing 
news stories, Digg.com, by Wu and Huberman |283j shows that the novelty of news half 
there decays in a very short period. Another typical instance is the communication system 
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Figure 17: (Color online) Illustration of time based recommendation where the target user has collected 
three items, I±, I2 and ^3, which are then used to predict the newest item I4. / is the decay function to 
weight the time difference between I4 and other items that were collected before. 



of e-commerce websites, which require real-time feedback among various agents |284j . Such 
information updates so frequently that it is impossible for individuals to evaluate each, or 
even read them in time. Consequently, an urgent question emerges: how to automatically 
filter out the irrelevant information, receive timely news and perform immediate yet appro- 
priate response? One promising solution lies in time-aware recommender systems, which 
can hopefully help to address aforementioned issues. Collaborative filtering (see Sec. [1]), as 
the most widely adopted method in recommender systems, is the first one to be considered. 
Following the classical collaborative filtering framework, most of the related work focuses 
on designing time factors to suppress old evaluations or objects. Generally, such kind of 
weight is expressed by various decay functions (see Fig. 17), which suggests that user 
interests in a single topic would decay with time (see Fig. 18). Ding and Li [285J weighed 
different items by putting smaller weights on older ones. Similar methods used the time 
factor to adaptively choose temporal neighborhoods and then obtain recommendations via 
the refined neighbors [286, 287, 1288] . Another kind of attempts consider decaying time 
to weigh user-item binary edges in bipartite networks. Liu and Deng |289j hypothesized 
the time effect decayed in a exponential manner, which could also been found in other 
empirical studies |283j and models [290]. A broad picture of collaborative filtering with 
time can be found in a recent Ph.D. thesis |291j . 

Another important issue in recommender systems is that of novelty. Although an 
item with the highest recommendation score is the most possible candidate for a target 
user, it may fail to be picked due to the diversity of human tastes. In such cases, these 
items should not always occupy the recommendation list and be recommended over and 
over again. Therefore, temporal diversity |292j becomes crucial in designing time-based 
algorithms. Xiang et al. divided user interests into two general categories: longer-term 
and short-term [32] ■ Longer-term interests govern the essential preferences of users and 
would not easily change over time. By contrast, short-term interests are more likely to be 
effected by social environment (Fig. 18 shows such difference). By identifying and making 
use of the differences between them using a time factor, in [32] they successfully provided 
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Figure 18: (Color online) Illustration of users' interests changing with time in the MovieLens data. Shifts 
of user interests are represented by the average similarity among item pairs of the target user within an 
observed time t (shown with gray circles). The dashed line represents the average similarity of all users. 

more reliable yet interesting recommendations, which can also be found in relative works 

fm\ . 

Besides recommendation, time also plays an important role in various fields, such as 
network growth [2941 1255] . identification of original scientific contributions [295, 296J, se- 
lecting the backbone structure of citation networks [297J, and aging effects in synchro- 
nization [298J. However, how to appropriately use the time information to discover the 
underlying dynamics of user preferences and help us master the current information era 
still remains an important research challenge. 

8.3. Iterative refinement 

Iteratively solved self-consistent equations are widely applied in characterizing struc- 
tural and functional features of nodes and/or edges in networked systems. Their solution 
is usually used to describe a stable distribution of a certain quantity in a system consisting 
of many interacting individuals, where the amount of this quantity assigned to individual i 
is affected by both the interacting rules and the amounts of this quantity assigned to other 
individuals interacting with i. In a directed network, the significance of a node is not only 
determined by its attributes (if applicable), but also contributed by the significance of its 
downstream nodes. For example, the classical set of self-consistent equations for PageRank 
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value G{i) of the web page i has the form |152j 



G(z) = c+(l-c)£^, (100) 

where j runs over all the web pages that point to i, c is the return probability accounting 
for the random browsing, and k° ut is j's out-degree. This set of equations represents 
a particular Markovian process on a network and can be successfully solved using an 
iterative approach thanks to the fast convergence to a stationary solution observed in 
most cases |299j . Besides web pages, similar self-consistent equations are also successfully 
applied in ranking people |300j . genes |301j . and so on. If instead of referring to nodes, the 
individuals refer to pairs of nodes, this approach can be used to quantify node similarity 
[142, 302J. In more complicated situations, the individuals can be users and items, or 
scientists and publications, and similar iterative equations embodied in bipartite networks 
can be applied in building quantitative reputation systems, namely simultaneously estimate 
people's reputation and objects' quality [3031 EH1 [3051 EH M 1308] . 

Besides the above-mentioned iterative equations, a closely related framework is that of 
the so-called self- consistent refinement [309| 1310] . In the link prediction and personalized 
recommendation, the known information is the adjacency matrix representing a unipartite 
or a bipartite network and the task is to estimate the likelihoods of link existence for 
currently zero elements of the adjacency matrix. For recommender systems with ratings, 
the algorithms need to predict unknown ratings according to the rating matrix. Denoting 
R the known matrix and R the predicted matrix {i.e., output), the procedure of many 
algorithms can be written in a generic form 

R = D(R), (101) 

where D is a matrix operator]^] Denoting the initial configuration as R(°) and the initial 
time step k = 0, a generic framework of self-consistent refinement reads |310j : (i) Implement 
the operation S)(R^^); (ii) Set the elements of R( fc+1 ) as 

(Jfe+ i) = f D(RW) JQ when R?J = 0, 

\ Ria when RfJ ^ 0. 1 } 

Then, set k — k + 1. (iii) Repeat (i) and (ii) until the difference between and R( fc_1 ) 
is smaller than a given terminating threshold. 

Considering the matrix series R(°), R^, • • • , R^ (T denotes the last time step) as a 
certain dynamics driven by the operator D, all the elements corresponding to the known 
items (i.e., ^ 0) can be treated as the boundary conditions expressing to the known 
and fixed information]^! If R is an ideal prediction, it should satisfy the self-consistent 



12 See |310j on how to use this generic form to represent the well-known similarity-based and spectrum- 
based algorithms for rating prediction. 

13 This is the essential difference between the self-consistent refinement and the above-mentioned iterative 
equations like PageRank, since in the latter case, every matrix element is free to be changed. 
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Figure 19: Prediction error (MAE) versus iterative step for MovieLens data. We use the MovieLens data 
that consists of N = 3020 users, M = 1809 movies, and 2.24 x 10 5 discrete ratings from 1 to 5. All the 
ratings are sorted according to their time stamps with 90% earlier ratings as the training set, and the 
remaining ratings (with later time stamps) as the testing set. 



condition R = 2)(R). However, this equation is not hold for most known algorithms. In 
contrast, the convergent matrix R( T ) is self-consistent. 

As shown in |31Uj . applying the self-consistent framework can lead to great improve- 



ments compared with non-iterative methods employing Eq. (101). We next show a simple 
example for similarity-based recommendation. Taking into account the different evalua- 
tion scales of different users |311j . we subtract the corresponding user average from each 



evaluated entry in the rating matrix R and get a new matrix R'. The predicted rating is 
calculated by using a weighted average, as: 

R ia = (103) 



where the similarity between items a and ft is defined by Eq. (34). As shown in Fig. 19 
experiment on MovieLens verifies the advantages of the iterative refinement, where the 
original similarity-based algorithm corresponds to the first iteration step. 

8.4- Hybrid algorithms 

Even for a good recommender algorithms it is difficult to address diverse needs of its 
heterogenous users. Hybrid methods overcome this problem by aptly combining recom- 
mendations obtained by different methods |312[ 1313] . Hybridization is hence often used 
in practical implementations of recommender systems, even in very early ones |162] . One 
of the most important applications of hybrid recommendation algorithms is to solve the 
cold-start problem, by combining collaborative and content data in such a way that even 
a new object that has never been rated before can be recommended |314j (similarly, a new 
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user who has never rated anything can receive some recommendations). Since hybrid al- 
gorithms combine different approaches to recommendation, they also have the potential to 
improve the diversity of recommendations [37] • The following classification of hybridization 
methods is taken from |15|: 



1. implement collaborative and content-based methods separately and combine their 
predictions, 

2. incorporate some content-based characteristics into a collaborative approach, 

3. incorporate some collaborative characteristics into a content-based approach, and 

4. construct a general unifying model that incorporates both content-based and collab- 
orative characteristics. 

Now we can extend class 4 to include also unifying models that incorporate two or more 



collaborative methods (see Sec. 6.4). 



Hybrid methods range from simple, such as using a linear combination of ratings ob- 
tained by different methods [160J, to very complex, such as employing Markov chain Monte 
Carlo methods [315J to model combined collaborative and content data. Recent Netflix 



prize (see Sec. 2.1) has provoked interest in sophisticated methods for combined predic- 
tions (also called ensemble learning or blending) which were shown to be very successful 
in lowering the error of predictions [316]. The main idea of blending is that the prediction 
vectors of F distinct recommendation methods, denoted as Xi, . . . , xp, are combined by a 
function Q : M F — > M. so that the prediction error on a test set (evaluated by RMSE, for 
example) is minimized. The optimal weighting function Q is obtained by linear regression, 
neural networks, or bagging predictors |317[ 1318] . For details and evaluation of different 
blending schemes on the extensive dataset provided by Netflix for the competition see [316J. 
Some challenges can also be partially solved by hybridization, for example, the link pre- 
diction algorithm can be used to generate artifical links that could eventually improve the 
recommendation in very sparse data |123L 1319] . 
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data set 


users 


objects 


ratings 


density 


k v 


k Q 


Movielens 


6,040 


3,706 


1,000,209 


4.5% 


166 


270 


Netflix 


8,000 


17,148 


1,632,172 


1.2% 


204 


95 



Table 5: Data sets used for evaluation of recommendation methods. Data density is defined as the ratio 
of the available ratings to the maximum possible number of ratings; ku and ko denote the average user 
and object degree, respectively. 



9. Performance evaluation 

In this section we briefly compare performance of individual algorithms presented in 
this review. To test the algorithms, we use two standard data sets: Movielens 1M_ and 

JT51 



Netflbq I The Movielens data set, which contains ratings from 6,040 users to 3706 movies 
(corresponding to the rating density of 4.5 • 10~ 2 ), is used in its original form. Our subset 
of the original Netflix data set was created by randomly selecting 8,000 users from the 
original data set released by Netflix for the Netflix Prize and keeping all their evaluations 
(see 2.1 for details on the competition). In this way, a data set with 17,148 objects (DVDs 



rented by the company to its users) and 1,632,172 ratings in the integer scale from one 
to five (corresponding to the rating density of 1.2 • 10~ 2 ) was created. Both data sets use 
the integer rating scale from one to five. For methods requiring no explicit ratings (binary 
data), we assume that all ratings greater or equal than three represent objects liked by 
the users and hence constitute a corresponding user-object link (if the rating is less than 
three, no link is formed). As a result, there are 1,387,039 and 836,478 links in the Netflix 
and Movielens data set, respectively. Table [5] summarizes basic properties of the data sets. 

Our two data sets differ not only by their density and user/object ratio — histograms 
of user and object degrees in Fig. [20] reveal further differences. Firstly, Movielens data 



were originally prepared in such a way that all users rated at least twenty movies. This 
constraint has not been applied to the Netflix data set, resulting in a considerable number 
of users with only little data on their past preferences available. Secondly, Netflix data also 
contains a large portion of movies that have been rated only a few times (this probably 
reflects the fact that the company rents a wide variety of DVDs, many of which are of 
interest for only a small part of the customers). Unsurprisingly, all degree distributions 
are broad and right-skewed, as similar to many other social systems [26J . 

To test a recommendation method, we employ the standard approach. First, a randomly 
selected small part of the input data is moved into a so-called probe. In our case, the probe 
contains 10% of ratings present in the input data set. The remaining 90% of the data is then 
given to the recommendation method and are used to estimate the missing ratings. The 
estimated missing ratings are then compared with the true ratings present in the probe set. 
This comparison is done by means of root mean square error (RMSE) and mean absolute 



14 This data set can be obtained at 
lD This data set can be obtained at 



http : //www . grouplens . org/node/73 



www . netf lixprize . com 
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user degree object degree 



Figure 20: User and object degree distribution for the two data sets used to evaluate recommendation 
methods. 



Movielens Netflix 



method 


RMSE 


MAE 


RMSE 


MAE 


overall average 


1.12 


0.93 


1.09 


0.92 


object average 


0.98 


0.78 


1.02 


0.82 


user average 


1.04 


0.83 


1.00 


0.80 


multilevel spreading 


0.94 


0.75 


0.97 


0.77 


user similarity 


0.91 


0.72 


0.93 


0.72 


object similarity 


0.89 


0.70 


0.96 


0.74 


SVD 


0.85 


0.67 


0.87 


0.68 


slope one 


0.91 


0.71 


0.93 


0.73 


slope one weighted 


0.90 


0.71 


0.93 


0.73 



Table 6: Performance of algorithms for data with explicit ratings (averaged over 10 realizations). 



error (MAE) in the case of data with explicit ratings and by means of precision, recall and 
the average relative rank in the case of data without explicit ratings (see Section 3.4 for a 
detailed description of these performance metrics). For precision and recall, we take into 
account top 100 places of each user's recommendation list. To eliminate possible effects of 
the probe selection, we repeat the procedure for ten independent randomly selected probe 
sets and present the averaged results. 

Method overall average is used only as a benchmark; it uses the average rating in 
the input data as estimate for every user-object pair. For similarity-based methods, we 
employ the Pearson correlation coefficient which slightly outperforms cosine similarity in 
terms of RMSE and MAE. For SVD, we used parameter values D = 20, rj = 0.001 and 
A = 0.1 that result in favorable performance (see Section sec:SVD for the meaning of these 
parameters). Note that a better founded approach would be to learn "optimal" values of 
these parameters from the data itself. This can be achieved by choosing {D,t],X} based 
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Movielens 



Netflix 



method 


-Ploo 


Rioa 


rank 




-RlOO 


rank 


global rank 


0.039 


0.311 


0.143 


0.041 


0.272 


0.066 


Bayesian clustering 


0.028 


0.276 


0.137 


0.045 


0.262 


0.069 


pLSA 


0.071 


0.575 


0.090 


0.071 


0.456 


0.050 


LDA 


0.081 


0.543 


0.093 


0.081 


0.439 


0.048 


ProbS 


0.052 


0.422 


0.112 


0.053 


0.359 


0.051 


HeatS 


0.039 


0.336 


0.116 


0.001 


0.020 


0.099 


ProbS+HeatS, A = 0.2 


0.067 


0.548 


0.080 


0.062 


0.421 


0.046 



Table 7: Performance of algorithms for data without explicit ratings (averaged over 10 realizations). 

on RMSE or MAE computed for a small test set of predictions (which could again be 
created by taking 10% of the input data) and only then reporting the resulting method's 
performance computed for the probe. However, with only three free parameters to tune, 
the results are likely to differ little from the results obtained by our naive approach where 
{D,f], A} are chosen directly from performance observed for the probe. 

While numerical performance values may seem very close to each other across all the 
methods (perhaps with the exception of overall average), differences between the methods 
from the user's point of view are much greater than one would expect from RMSE varying 
at the second decimal place. For example, user average outperforms object average for the 
Netflix data set, yet in fact it has zero filtering ability. For a given user, estimated rating 
of all unrated objects is the same (equal to this user's average rating), this user is thus 
provided no useful information as to which object to select in the future. Further, the 
performance of object average may seem appealing with respect to the method's simplicity, 
yet one can easily check that objects with the highest estimated score are likely those that 
received only a few ratings. This is because while a rarely viewed object may easily receive 
the highest possible average rating of five, a popular object inevitable receives some worse 
marks which result in a lower average rating. For example, top-rated movies in both tested 
data sets have all received less than five ratings and scored 5.0 on average. This analysis 
shows that RMSE and MAE, while useful and easy to understand, give only a very limited 
information about a method's performance. 

Table [7] summarizes performance of methods requiring data without explicit ratings 
(binary data). As performance metrics we use precision P, recall R and the relative rank 
(marked as rank in the table) of probe objects (if probe object a belonging to user i appears 
on place x of this user's recommendation list and this user has collected hi objects, the 
relative rank of a is x/(M — ki)). Method global rank corresponds to recommendation 
of the most popular items that have not yet been collected by a given user. The results 
of pLSA and LDA are obtained from K = 50, while a slight increase in performance 
is observed when K increases further. Results of Bayesian clustering are obtained with 
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(K nseT , intern) = (70, 35) for Movielens and (K user , intern) = (70, 140) for Netflix, where 
K user /K[ tems is in a rough proportion to the corresponding ratio of users to items. Note 
that global rank and Bayesian clustering are able to yield low relative rank but they fail to 
score in precision and recall. 
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10. Outlook 



After reviewing extensively the past work in the field of recommender systems, we 
describe here a few challenges that the field of recommendation needs to tackle in the 
future. 

To begin with, let us take the conceptual question of the possibility of effective recom- 
mendation. For a long time, we all thought that we know ourselves better than anyone 
else does. Without much fanfare and fuss about philosophical implications, IT experts and 
online businesses continue in their exploration of the part of our knowledge that resides 
not on in our minds, but at crossroads of communities. This in effect has violated both 
our self-knowledge belief and an implicit, long cherished notion: individuals are masters of 
themselves. Conceptually, this admitting is nothing short of a revolution. The potential 
of the new approach is huge: the "extra-body" knowledge about our preferences manifests 
itself in communication data and is thus much easier to analyze and decipher than the part 
hiding among our neurons and synapses. What Amazon or Netflix does is just scratching 
on the surface of the huge potential, as they only take into account a tiny fraction of 
information about us. Google's Gmail gains more insight about what its users do in the 
hope that emails reveal more than what can be obtained from data about searching, book- 
buying, and movie-rating, thereby matching ads more closely to our preferences and hence 
increasing their efficiency Some recent online communities go even a step further than 
Gmail. For example, Facebook.com lets its members to create trusted relationships and 
keeps track of members' activities and conversations, obtaining the opportunity to infer 
intimate details about users' preferences. Though most of this information is implicit and 
not yet ready for recommendation, the huge data basis in principle can yield much more 
insight than hitherto seen. By letting the users to reveal more and more, the potential 
for inferring their future wants grows and we are still to know what the consequences will 
be. The induced danger of privacy violation calls for new, privacy preserving recommender 
systems where no sensitive data leave users' computers, yet users can enjoy the benefit of 
collaborative filtering. 

IMDB (Internet Movie Database) and similar web sites aggregate millions of votes for 
a wide range of movies, and online sellers of movies use some degree of collaborative fil- 
tering to make recommendations given one's past purchasing history. However, an open 
reputation-sharing mechanism remains to become widespread. One can project forward 
to imagine innovative applications, such as "Movies Wanted": a system where plot de- 
scriptions are collaboratively developed and voted on, to highlight movies desired by a 
constituency. The net effect of reputation filtering will be to bring more old, foreign, and 
niche movies to light, with similar effects for music and other culture. Cultural oppor- 
tunities that languish for want of attention due to high search costs will reach audiences 
that did not know what they were missing. Many recommender systems provide sugges- 
tions based on expressed or observed preferences. But reputations could also encode other 
properties of media, such as "ethicalness" of lyrics (and indeed of the performers' lives and 
aims if one desires), or specific legal or reproduction rights. Licensing schemes like Creative 
Commons certify an artistic work as having particular legal properties; it is then feasible 
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to provide both recommendations and direct access just within the set of freely available 
music. 

Beyond music and movies, numerous cultural areas and experience goods are ripe for 
recommendation services provided by reputations. Book ratings and suggestions provide 
a navigation tool through humanity's ever-growing literary output — most notably from 
Amazon, but also from a variety of small-scale services and personal lists. Travel guide- 
books aid in getting the insider view of an unfamiliar locale, but interpreted experiences 
of natives and previous travelers could be even better. Whether for festivals, museums, 
opera, or the thousands of other shared activities which enrich our social landscape, the 
cultural sector is fertile ground for development. 

In the age of Internet and World Wide Web, ICT tools allows information to spread 
faster and wider than ever before, and dominates the way we form our opinions and knowl- 
edge. While there has been an undeniable progress in the information availability, a fun- 
damental question remains elusive: do we get more diverse information than in the past? 
Although the ICT Revolution was expected to allow people to access ever more diversified 
information sources and products, one often sees that this is not the case. Popular viral 
videos copy the strategy of blockbuster films and target the tastes of the general audience, 
giving rise to global hits. On the Internet, a few sites attire a huge part of web traffic. A 
similar process is in action in science where disproportional attention is given to a small 
fraction of all new and exciting works [295]. The problem is that search engines and rec- 
ommender systems fall prey to a self-reinforcing rich-get-richer phenomenon: items that 
were popular in the past tend to be served to even more users in the future. The natu- 
ral outcomes of such defective dynamics are the narrowing of people tastes and opinions 
together with a general cultural flattening. To address this issue, we need to consider the 
long-term impacts of information filtering systems on the information ecology and study 
information filtering tools that favor diversity without sacrificing their overall performance 

Another interesting facet of the mentioned diversity challenge is related to a concept 
of "crowd-avoidance". There are situations where the generated data naturally fits the 
paradigm of recommender systems, yet using a standard recommender system may result 
in poor outcomes. For example, when given data of user preferences for restaurants, it 
is natural to recommend a user a new place to eat. However, if too many users are 
recommended the same place, it gets crowded and nobody enjoys their meal. Similarly, 
when given data of industrial sectors active in various countries (which can be effectively 
represented by a bipartite network [320J, so much discussed and utilized in this review), one 
may recommend a country a new sector on the basis of its similarity with already active 
sectors. However, if the country faces a strong competition from its neighbors, it may do 
better by choosing a less similar sector where the competition is weaker. The same happens 
on a smaller scale where companies routinely compete with products of other companies, 
yet avoiding a direct clash may be very beneficial. The concept of crowd avoidance in 
recommender systems could yield benefits in situations similar to those described above, 
where resources cannot be shared by an arbitrary number of parties due to constraints of 
geographic space or limited interest of customers. 
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The crowd avoidance phenomenon ranges from practically non-existing (in the case of 
e-book distribution, for example) to very strong (no two customers can share an item) 
where it approaches the classical assignment problem |321j . The most challenging seems 
to be the moderate case where recommending an item to a small number of consumers 
does not create a problem yet (which fits well the above-described product and restaurant 
recommendation, for example). Note that this whole problem is in principle similar to 
quantum physics systems where occupation numbers are confined by constraints (such as 
the Pauli exclusion principle which says that two identical fermions cannot simultaneously 
occupy the same quantum state) or where mutual repulsion among particles sharing the 
same site exists. Analogies with physics can thus prove useful in studying this kind of 
systems. 

The science of recommendation is just starting — despite impressive progresses, much 
remains to be understood. For further advances intuition alone is no longer enough and 
a multidisciplinary approach will surely bring powerful tools that may help innovative 
matchmakers to turn the immense potential of recommendations into real life applications. 
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